Measuring Progress Toward AGI: A Cognitive Framework
Ryan Burnell, Yumeya Yamamori, Orhan Firat, Kate Olszewska, Steph Hughes-Fitt, Oran Kelly, Isaac R. Galatzer-Levy, Meredith Ringel Morris
Abstract
Despite widespread discussion of AGI, there is no clear framework for measuring progress toward it. This ambiguity fuels subjective claims, makes it difficult to track progress, and risks hindering responsible governance. As a starting point to address this gap, we present a framework for understanding system capabilities in relation to human cognitive abilities. Drawing from decades of research in psychology, neuroscience, and cognitive science, we introduce a Cognitive Taxonomy that deconstructs general intelligence into 10 key cognitive faculties. We then propose a rigorous evaluation protocol in which a system's performance is measured across a suite of targeted, held-out cognitive tasks, generating a 'cognitive profile' that can be used to understand a system's strengths and weaknesses. We hope this framework will provide a practical roadmap and an initial step toward more rigorous, empirical evaluation of AGI.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Measuring Progress Toward AGI: A Cognitive Framework"
1. Core Contribution
This paper proposes a Cognitive Taxonomy comprising 10 cognitive faculties (Perception, Generation, Attention, Learning, Memory, Reasoning, Metacognition, Executive Functions, Problem Solving, and Social Cognition) as a framework for measuring AI system capabilities relative to human cognition. The taxonomy is grounded in established cognitive science literature, and the paper outlines a three-stage evaluation protocol: (1) cognitive assessment across targeted tasks, (2) collection of human baselines, and (3) construction of "cognitive profiles" mapping system strengths and weaknesses against human population distributions.
The core problem addressed is legitimate and important: the field lacks a standardized, comprehensive framework for measuring how AI capabilities compare to human cognition across its full breadth. The proposed solution is essentially a detailed organizational schema paired with methodological guidelines for evaluation.
2. Methodological Rigor
This paper is primarily a position/framework paper rather than an empirical contribution. As such, it should be evaluated on the coherence and completeness of its framework rather than experimental rigor.
Strengths in rigor:
Weaknesses in rigor:
3. Potential Impact
Positive impact potential:
Limitations on impact:
4. Timeliness & Relevance
The paper is highly timely. The rapid advancement of frontier AI systems (GPT-4, Gemini, Claude, etc.) has created urgent need for rigorous capability assessment frameworks. Policymakers worldwide are developing AI governance frameworks and need grounding for terms like "AGI." The paper directly addresses a current bottleneck in responsible AI development.
However, the paper arrives in a field already populated with related efforts: the Levels of AGI framework (Morris et al., 2024, from the same group), Chollet's ARC benchmark, Humanity's Last Exam, and various cognitive evaluation proposals. The incremental advance over the group's own prior work is a more detailed cognitive decomposition, but the lack of empirical implementation limits its distinctiveness.
5. Strengths & Limitations
Key Strengths:
Key Limitations:
Additional Observations
The paper reads more as a research agenda or roadmap than a completed scientific contribution. Its value lies primarily in organizing thinking and establishing shared vocabulary. The author list from Google DeepMind gives it institutional weight that may drive adoption regardless of scientific novelty. The extensive appendix taxonomy, while thorough, is essentially a literature review of cognitive science organized into a hierarchy—useful but not groundbreaking.
Generated May 28, 2026
Comparison History (13)
Paper 2 addresses a critical, highly timely challenge with broad implications across AI development, cognitive science, and governance: measuring AGI progress. While Paper 1 offers a strong, empirically validated technical contribution to multimodal reasoning, Paper 2 provides a foundational taxonomy and evaluation protocol that could shape future benchmarking standards, policy-making, and interdisciplinary research, giving it a higher potential for widespread, paradigm-shifting scientific impact.
Paper 1 offers a concrete, novel optimization framework for improving LLM agent skills with demonstrated empirical gains on established benchmarks, suggesting near-term applicability and methodological testability. Its gradient-descent-inspired formulation (diagnostic “text gradients,” momentum memory, structured patching) is an actionable contribution likely to be reused in agent/tooling research and practice. Paper 2 is timely and potentially broad, but as described it is more conceptual and depends heavily on adoption and the eventual quality/validity of proposed tasks and protocols; immediate rigor and measurable impact are less certain from the abstract alone.
Paper 2 likely has higher scientific impact because it targets a widely recognized, cross-cutting bottleneck: standardized measurement of progress toward AGI. A cognitive taxonomy plus evaluation protocol is broadly applicable across model classes, benchmarks, and governance contexts, and is timely amid rapid capability gains and policy needs. Its potential real-world applications (evaluation, auditing, safety, regulation, research comparability) span many fields. Paper 1 is technically promising for robotics/physical world modeling, but appears narrower in scope and its impact depends more on empirical rigor and adoption within a specific subcommunity.
Paper 1 addresses a fundamental gap in AI research—the lack of a rigorous framework for measuring progress toward AGI. By proposing a cognitive taxonomy grounded in decades of cognitive science research, it offers a novel interdisciplinary contribution with broad impact across AI, cognitive science, and AI governance/policy. Its potential to standardize AGI evaluation could influence the entire field. Paper 2, while practically useful, presents an incremental improvement (applying offline RL to code generation) with narrower scope and less transformative potential.
Paper 2 addresses a critical and immediate flaw in LLM evaluation: agents relying on intrinsic knowledge rather than actual search. By providing rigorous empirical diagnostics and introducing a dynamic, actionable benchmark (LiveBrowseComp), it offers immediate utility to the highly active field of AI agents. While Paper 1 presents an interesting conceptual framework for AGI, it is primarily theoretical. Paper 2's concrete methodology, dataset, and relevance to current evaluation bottlenecks make it highly likely to see rapid adoption and citations.
Paper 2 addresses a specific, well-defined failure mode in reasoning models with a concrete, testable solution (JTS framework) and demonstrates empirical results. It has immediate practical implications for AI safety in high-risk domains like medical AI. While Paper 1 tackles the important topic of AGI measurement, it is more of a conceptual/taxonomic framework without demonstrated empirical validation. Paper 2's methodological rigor, actionable contributions (reinforcement learning approach, measurable metrics like A@D), and direct relevance to current deployed systems give it higher near-term scientific impact and citability.
Paper 2 presents a concrete, novel technical method (HDMI) with rigorous evaluation, clear improvements over prior methods, and immediate applicability to understanding and controlling LLM internals—a highly active research area. Paper 1 proposes a conceptual framework for measuring AGI progress, which, while timely and important, is more of a position/roadmap paper lacking empirical validation of the framework itself. Paper 2's methodological contribution (probe-free causal interventions with completeness/selectivity metrics) is more likely to be adopted and built upon by the research community.
Paper 1 addresses a fundamental gap in AI research—how to measure progress toward AGI—by proposing a comprehensive cognitive taxonomy and evaluation framework grounded in decades of cognitive science. This has broad, cross-disciplinary impact spanning AI, cognitive science, policy, and governance. Its timeliness is high given the current AGI discourse. Paper 2, while solid applied work combining LLMs and GNNs for fraud detection, is more incremental and domain-specific, with narrower impact limited to fraud detection and graph-based ML methods.
Paper 1 addresses a fundamental, widely-discussed challenge—measuring progress toward AGI—by proposing a comprehensive cognitive taxonomy grounded in decades of cognitive science research. Its breadth of impact spans AI, cognitive science, neuroscience, and AI governance/policy, making it relevant to a very large audience. Paper 2, while methodologically rigorous and practically useful, addresses a narrower problem (artifact drift in agent benchmarks for enterprise tasks). Paper 1's framework has greater potential to shape discourse, standardize evaluation across the field, and influence policy, giving it higher estimated scientific impact.
Paper 2 offers a more immediate and actionable scientific impact by shifting the AI safety paradigm from theoretical alignment to practical runtime controllability. It introduces a concrete benchmark (ControlBench) and architectural framework, which are highly likely to drive experimental follow-up work and citations. While Paper 1 provides a valuable conceptual framework for AGI evaluation, Paper 2 addresses urgent, real-world deployment risks of agentic AI with empirical tools and testable methodologies.
Paper 1 addresses a more fundamental and broadly impactful problem—how to measure progress toward AGI—by proposing a cognitive taxonomy grounded in decades of cognitive science research. This framework has potential to shape how the entire AI field tracks progress, informs governance, and structures evaluation. While Paper 2 (JobBench) makes a valuable contribution with its human-centered benchmark for occupational AI agents, it is more narrowly scoped as one benchmark among many. Paper 1's interdisciplinary foundation and its potential to become a standard framework for AGI evaluation give it greater breadth of impact.
Paper 2 addresses a timely, concrete, and empirically grounded problem—privacy leakage in multi-agent LLM systems—with novel experimental methodology (multi-agent social simulation) and striking quantitative findings (e.g., 8x contagion effect, amplified leakage rates). It directly challenges current safety evaluation paradigms with actionable results relevant to real-world deployments. Paper 1 proposes a conceptual framework for AGI measurement, which is valuable but more speculative, harder to validate, and less immediately actionable. Paper 2's concrete findings and methodological innovation give it broader and more immediate impact across AI safety, policy, and deployment practices.
Paper 2 addresses a fundamental gap in AI research—how to measure progress toward AGI—which has broad implications across AI research, policy, and governance. Its cognitive taxonomy framework could become a widely adopted evaluation standard, impacting multiple fields (AI, cognitive science, policy). Paper 1, while technically sound and practically useful, addresses a narrower optimization problem (retrieval pipeline configuration) with more limited scope. Paper 2's timeliness given current AGI debates and its potential to shape evaluation norms and responsible governance give it substantially broader impact potential.