Process Matters more than Output for Distinguishing Humans from Machines
Milena Rmus, Mathew D. Hardy, Thomas L. Griffiths, Mayank Agrawal
Abstract
Reliable human-machine discrimination is becoming increasingly important as large language models and autonomous agents are deployed in online settings. Existing approaches evaluate whether a system can produce behavior or responses indistinguishable from those of a human, following the emphasis on outputs as a criterion for intelligence proposed by Alan Turing. Cognitive science offers an alternative perspective: evaluating the process by which behavior is produced. To test whether cognitive processes can reliably distinguish humans from machines, we introduce CogCAPTCHA30, a battery of 30 cognitive tasks designed to elicit diagnostic process-level features even when task performance is matched. Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). To evaluate agentic process differences, we compare off-the-shelf frontier agents (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro), Centaur (a language model fine-tuned on 10.7M human decisions), and two task-specific fine-tuning approaches applied to Qwen2.5-1.5B-Instruct: action-level supervised fine-tuning (A-SFT) and process-level fine-tuning (P-SFT), which directly optimizes process features. Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks. Explicit process-level supervision can improve human behavioral mimicry, but only if appropriate task-specific process representations are available, highlighting process specification as a bottleneck for achieving human-like cognitive processes in machines.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Process Matters more than Output for Distinguishing Humans from Machines"
1. Core Contribution
The paper makes a conceptually clear and timely argument: as AI systems approach human-level task *performance*, the *process* by which behavior is produced becomes the more reliable discriminative signal. The authors operationalize this by introducing CogCAPTCHA30, a battery of 30 cognitive tasks spanning working memory, decision-making, perception, and planning/reasoning, each paired with process-level behavioral features (129 total) drawn from cognitive science. The key finding is that process-level features yield substantially higher classifier AUC (mean 0.88) than output metrics alone (0.55 on performance-matched tasks) for distinguishing humans from frontier AI agents.
The secondary contribution is a controlled "red-teaming" study examining what it takes for agents to close the process gap: action-level supervised fine-tuning (A-SFT), process-level fine-tuning (P-SFT), and comparison against Centaur (a 70B model fine-tuned on 10.7M human decisions). The finding that P-SFT improves mimicry on supervised features but fails to transfer across tasks highlights process specification as a bottleneck.
2. Methodological Rigor
Strengths:
Weaknesses:
3. Potential Impact
The paper addresses a genuinely important practical problem — human-machine discrimination in online settings — and offers a principled alternative to output-based CAPTCHAs. The cognitive science framing is compelling and could influence:
However, practical deployment faces challenges: the short-form constraint (≤10 trials, <1 minute) is practical but may limit the richness of extractable process signatures. Additionally, as the authors acknowledge, adversarial adaptation could erode discriminative power over time.
4. Timeliness & Relevance
This paper is highly timely. The rapid improvement of frontier models on traditional benchmarks and CAPTCHAs creates urgent need for more robust human-verification approaches. The paper cites GPT-4.5 being judged human 73% of the time in Turing Tests and frontier vision models solving reCAPTCHAv2. The inclusion of GPT-5 (a very recent model) demonstrates currency. The adversarial framing — acknowledging that any discriminator may become a target for optimization — reflects mature thinking about the arms-race dynamics in this space.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
This is a well-conceived paper that makes a clear and timely contribution at the intersection of cognitive science and AI safety/security. The core insight — that process-level features discriminate humans from machines more reliably than outputs — is supported by extensive empirical evidence across the 30-task battery. The fine-tuning analysis, while limited in scope, provides useful insights about the difficulty of process mimicry. The work would benefit from broader fine-tuning experiments, more formal statistical analysis, and deeper adversarial evaluation, but the contribution is solid and addresses a pressing need.
Generated May 8, 2026
Comparison History (24)
Paper 1 has higher likely scientific impact due to broader relevance and timeliness: it directly interrogates the reliability of autonomous “AI scientist” systems across eight scientific domains with large-scale empirical evaluation (25,000+ runs) and identifies a fundamental limitation—lack of epistemically normative reasoning—that affects trustworthiness of AI-generated science. Its claims (scaffolds contribute little; outcome-only evaluation fails; reasoning must be a training target) can reshape evaluation practices and model-training priorities across AI-for-science, agent design, and research governance. Paper 2 is valuable but narrower (human–machine discrimination via cognitive tasks) and more application-specific.
Paper 1 has higher potential scientific impact due to its novel shift from output-based to process-based human–machine discrimination, introducing a general-purpose cognitive task battery (CogCAPTCHA30) and demonstrating strong separability even under output matching. Its implications span AI evaluation, cognitive science, security/anti-bot systems, and alignment/behavioral mimicry, with timely relevance as frontier agents proliferate. Methodologically, it compares multiple model classes and fine-tuning regimes and identifies a key limitation (process representation transfer), offering broadly applicable insights beyond a single application domain.
Paper 2 introduces a fundamentally new paradigm for human-machine discrimination based on cognitive processes rather than outputs, with broad implications across AI safety, cognitive science, security (CAPTCHAs), and the philosophy of intelligence. Its interdisciplinary nature, timeliness given rapid LLM/agent deployment, and the novel CogCAPTCHA30 benchmark give it wider impact potential. Paper 1, while technically strong with practical results on SWE-Bench, addresses a narrower model merging problem with more incremental contributions to the existing reasoning enhancement literature.
Paper 2 introduces a fundamentally novel paradigm shift—evaluating cognitive processes rather than outputs for human-machine discrimination. This has broad implications across AI safety, cognitive science, authentication systems, and AI alignment. The CogCAPTCHA framework addresses the timely and critical challenge of distinguishing humans from increasingly capable AI agents. Paper 1, while valuable as an engineering benchmark for CAD code generation, serves a narrower community. Paper 2's cross-disciplinary relevance (cognitive science, security, ML) and conceptual innovation give it higher potential for widespread scientific impact.
Paper 2 addresses the timely and broadly impactful problem of human-machine discrimination as AI agents become increasingly capable. CogCAPTCHA30 offers a practical, novel framework grounded in cognitive science that shifts evaluation from outputs to processes—relevant to AI safety, security, and cognitive science. It benchmarks frontier models (GPT-5, Claude, Gemini) and introduces actionable fine-tuning methods. Paper 1 proposes a useful but more niche formalism (Engagement Process) for temporal interaction modeling that extends POMDPs. While technically interesting, its impact is narrower, primarily within RL/agent design communities.
Paper 2 introduces a fundamentally new paradigm for human-machine discrimination based on cognitive processes rather than outputs, which has broad implications across AI safety, cognitive science, and security. CogCAPTCHA30 addresses the increasingly urgent problem of distinguishing humans from LLMs/agents, with clear real-world applications (CAPTCHAs, online authentication, AI safety). It evaluates frontier models (GPT-5, Claude Sonnet 4.5) and proposes novel fine-tuning approaches. Its interdisciplinary nature spanning cognitive science and AI, combined with timeliness as AI capabilities advance, gives it broader impact potential than Paper 1's more incremental improvement to neurosymbolic reasoning.
Paper 1 has higher likely scientific impact due to a more novel cross-disciplinary contribution: jointly benchmarking frontier LRMs on real human learning behavior, game performance, and fMRI brain-prediction in complex, naturalistic tasks, with strong quantitative gains and mechanistic ablations (state representation vs planning). This tightly connects AI evaluation with cognitive neuroscience and may influence both model development and theories of human learning. Paper 2 is timely and useful for human–machine discrimination and evaluation, but its impact is more application/policy-facing and may be narrower scientifically, with process features potentially task- and dataset-specific.
Paper 1 has higher likely scientific impact due to stronger novelty (process-level human–machine discrimination via a cognitively grounded task battery), broader cross-field relevance (AI evaluation, cognitive science, security/anti-bot, human-computer interaction), and clear real-world application need as LLM agents proliferate. The introduction of CogCAPTCHA30 and empirical evidence that process features outperform outputs (AUC 0.88) provide a durable evaluation framework and new research direction (process specification bottleneck). Paper 2 is timely and useful for RLVR training, but is a more incremental algorithmic improvement with narrower scope.
Paper 1 introduces a fundamentally novel paradigm for human-machine discrimination by shifting from output-based to process-based evaluation, grounded in cognitive science. CogCAPTCHA30 addresses the increasingly urgent problem of distinguishing humans from AI agents as LLMs improve, with broad implications for security, AI safety, and cognitive science. The rigorous methodology (AUC=0.88, multiple model comparisons, fine-tuning approaches) and the conceptual insight that process specification is a bottleneck for human-like AI are highly impactful. Paper 2, while solid engineering, is more incremental—extending skill frameworks to be multimodal for visual agents, with narrower scope and less conceptual novelty.
Paper 1 addresses a fundamental and timely problem—distinguishing humans from AI—with broad implications across security, AI safety, cognitive science, and online trust. Its novel cognitive process-based approach (CogCAPTCHA30) offers a paradigm shift from output-based evaluation, with wide applicability as LLMs become ubiquitous. Paper 2 is strong applied work in synthetic biology but addresses a narrower domain. Paper 1's interdisciplinary breadth, timeliness given rapid AI deployment, and conceptual novelty (process vs. output distinction) give it higher potential for broad scientific impact.
Paper 2 introduces a fundamental paradigm shift by moving from output-based to process-based evaluation for human-machine discrimination. This addresses a critical, timely issue in AI safety and security with broad implications across cognitive science, AI evaluation, and bot detection. While Paper 1 offers a valuable methodological improvement for traffic signal control, Paper 2 has significantly broader cross-disciplinary applicability, a more novel conceptual framework, and evaluates frontier models on a systemic level, indicating a higher potential for widespread scientific impact.
Paper 2 introduces a paradigm shift in human-machine discrimination by focusing on cognitive processes rather than outputs, addressing a critical vulnerability in the era of advanced LLMs. Its interdisciplinary approach, novel benchmark (CogCAPTCHA30), and broad implications for AI safety, bot detection, and cognitive modeling offer wider and more fundamental scientific impact compared to Paper 1's methodological improvements in agent context efficiency.
Paper 2 addresses a critical, highly timely challenge with broad societal and interdisciplinary implications: distinguishing humans from AI agents. By shifting the paradigm from output-based evaluation (Turing test) to process-level cognitive features, it offers a novel approach with wide applications in AI safety, cybersecurity, and cognitive science. While Paper 1 provides valuable insights for single-cell genomics and foundation model feature extraction, its impact is relatively confined to computational biology, whereas Paper 2's findings fundamentally impact how we evaluate and interact with increasingly autonomous AI systems across all domains.
Paper 1 addresses a critical, timely problem in AI safety and bot detection by fundamentally shifting the evaluation paradigm from output-matching to process-tracking. Its implications span cognitive science, AI evaluation, and web security, offering broader societal and interdisciplinary scientific impact than Paper 2, which focuses on a more specialized domain of AI-assisted physics reasoning.
Paper 1 introduces a fundamentally novel framework (CogCAPTCHA30) for human-machine discrimination based on cognitive processes rather than outputs, addressing a critical and timely problem as LLMs become more capable. It bridges cognitive science and AI safety, has broad implications for security (CAPTCHAs, bot detection), AI alignment, and understanding machine cognition. The work evaluates multiple frontier models and fine-tuning strategies, offering deep methodological rigor. Paper 2, while solid, addresses a more incremental improvement in heterogeneous federated learning with narrower scope and more limited cross-field impact.
Paper 1 introduces a novel paradigm shift in human-machine discrimination by focusing on cognitive processes rather than outputs, with broad implications across AI safety, cognitive science, and online security. The CogCAPTCHA30 benchmark and the systematic comparison of fine-tuning approaches offer deep methodological contributions. Paper 2, while practically useful, represents an incremental engineering contribution (iterative refinement pipelines are well-established) in a narrower domain. Paper 1's conceptual framework—process-based evaluation—has wider cross-disciplinary impact and addresses a fundamental, timely challenge as AI capabilities advance.
Paper 2 introduces a fundamentally novel paradigm shift—evaluating cognitive processes rather than outputs for human-machine discrimination. CogCAPTCHA30 has broad implications across AI safety, security (CAPTCHA systems), cognitive science, and the philosophical question of machine intelligence. It engages with frontier models (GPT-5, Claude 4.5, Gemini 2.5 Pro) and proposes a framework applicable as LLMs become more capable. Paper 1, while practically useful, addresses a narrower optimization problem (cost-effective bandits in finance) with more incremental contributions. Paper 2's cross-disciplinary impact and timeliness give it higher potential.
Paper 1 proposes a foundational paradigm shift in AI evaluation, moving from output-based criteria to process-based cognitive metrics. By introducing CogCAPTCHA30, it addresses an urgent, cross-disciplinary challenge: distinguishing humans from increasingly sophisticated AI agents. This has immense real-world applications in cybersecurity, AI safety, and cognitive science. While Paper 2 offers a rigorous framework for cost-effective deployment of LLMs in bandit problems, its impact is narrower, primarily optimizing financial and recommendation systems. Paper 1's broader implications for fundamental human-machine differentiation and AI alignment give it a significantly higher potential scientific impact.
Paper 2 offers a paradigm shift from output-based to process-based evaluation for human-machine discrimination, fundamentally challenging traditional Turing Test approaches. Its interdisciplinary bridge between cognitive science, AI safety, and cybersecurity gives it broader scientific impact compared to Paper 1, which, while highly innovative in mitigating long-video context limits, represents a more domain-specific methodological advancement in computer vision and RAG frameworks. The introduction of CogCAPTCHA30 provides a highly timely and widely applicable benchmark for the pressing issue of bot detection.
Paper 1 introduces a fundamentally novel framework for human-machine discrimination based on cognitive processes rather than outputs, which addresses a timely and increasingly critical problem as LLMs become more capable. The CogCAPTCHA30 battery and the theoretical insight that process-level features outperform output-matching metrics (AUC=0.88) represent a paradigm shift with broad implications for AI safety, authentication, and cognitive science. Paper 2, while practically useful as an engineering toolkit for biomedical agent evaluation, is more incremental—primarily a systems contribution that standardizes existing evaluation practices rather than introducing new scientific insights.