Process Matters more than Output for Distinguishing Humans from Machines

Milena Rmus, Mathew D. Hardy, Thomas L. Griffiths, Mayank Agrawal

May 7, 2026

arXiv:2605.06524v1 PDF

cs.AI(primary)

#99of 2292·Artificial Intelligence

#99 of 2292 · Artificial Intelligence

Tournament Score

1543±48

10501800

92%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6

Novelty7

Clarity8

Tournament Score

1543±48

10501800

92%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Reliable human-machine discrimination is becoming increasingly important as large language models and autonomous agents are deployed in online settings. Existing approaches evaluate whether a system can produce behavior or responses indistinguishable from those of a human, following the emphasis on outputs as a criterion for intelligence proposed by Alan Turing. Cognitive science offers an alternative perspective: evaluating the process by which behavior is produced. To test whether cognitive processes can reliably distinguish humans from machines, we introduce CogCAPTCHA30, a battery of 30 cognitive tasks designed to elicit diagnostic process-level features even when task performance is matched. Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). To evaluate agentic process differences, we compare off-the-shelf frontier agents (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro), Centaur (a language model fine-tuned on 10.7M human decisions), and two task-specific fine-tuning approaches applied to Qwen2.5-1.5B-Instruct: action-level supervised fine-tuning (A-SFT) and process-level fine-tuning (P-SFT), which directly optimizes process features. Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks. Explicit process-level supervision can improve human behavioral mimicry, but only if appropriate task-specific process representations are available, highlighting process specification as a bottleneck for achieving human-like cognitive processes in machines.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Process Matters more than Output for Distinguishing Humans from Machines"

1. Core Contribution

The paper makes a conceptually clear and timely argument: as AI systems approach human-level task *performance*, the *process* by which behavior is produced becomes the more reliable discriminative signal. The authors operationalize this by introducing CogCAPTCHA30, a battery of 30 cognitive tasks spanning working memory, decision-making, perception, and planning/reasoning, each paired with process-level behavioral features (129 total) drawn from cognitive science. The key finding is that process-level features yield substantially higher classifier AUC (mean 0.88) than output metrics alone (0.55 on performance-matched tasks) for distinguishing humans from frontier AI agents.

The secondary contribution is a controlled "red-teaming" study examining what it takes for agents to close the process gap: action-level supervised fine-tuning (A-SFT), process-level fine-tuning (P-SFT), and comparison against Centaur (a 70B model fine-tuned on 10.7M human decisions). The finding that P-SFT improves mimicry on supervised features but fails to transfer across tasks highlights process specification as a bottleneck.

2. Methodological Rigor

Strengths:

The experimental design is well-structured, with both humans (n=97) and agents (n=150 frontier agent runs) completing identical browser-based interfaces, ensuring fair comparison.

The use of multiple frontier models (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro) provides breadth, and the inclusion of Centaur as an intermediate benchmark is informative.

The classifier evaluation uses 5-fold stratified cross-validation with class balancing, and the fine-tuning experiments use 2-fold cross-validation ensuring no participant overlap between training and evaluation.

The P-SFT loss formulation (Equation 1) is clearly specified and the differentiable feature estimation approach is well-documented in the appendix.

Weaknesses:

The human sample is modest (n=97 from a single Prolific pool), raising questions about demographic and cultural generalizability of "human-like" process.

The fine-tuning experiments are conducted only on a 1.5B parameter model (Qwen2.5-1.5B-Instruct), leaving open whether larger models would show different dynamics. The choice is understandable for computational reasons but limits generalizability claims.

The P-SFT evaluation is restricted to only 3 of 30 tasks (IGT, WCST, Information Sampling) — all structured sequential decision-making tasks with discrete action spaces. This is a narrow slice of the battery.

The cross-task transfer evaluation is somewhat limited: the paper shows P-SFT doesn't transfer across these three tasks, but it's unclear how much structural overlap exists between them. A more systematic analysis of feature-space overlap would strengthen the transfer claims.

Statistical rigor varies: some comparisons use formal tests (Mann-Whitney U), while key claims about relative model performance rely primarily on descriptive metrics (Cohen's d, fool rates) without formal hypothesis testing or confidence intervals.

3. Potential Impact

The paper addresses a genuinely important practical problem — human-machine discrimination in online settings — and offers a principled alternative to output-based CAPTCHAs. The cognitive science framing is compelling and could influence:

Security and verification systems: Process-based authentication could complement or replace traditional CAPTCHAs, especially as vision-language models defeat existing challenges.

AI behavioral alignment: The finding that process mimicry requires task-specific process representations has implications for how we think about training human-like AI agents.

Cognitive science benchmarking: CogCAPTCHA30 could become a useful benchmark for evaluating how closely AI agents replicate human cognitive processes, complementing existing cognitive benchmarks.

However, practical deployment faces challenges: the short-form constraint (≤10 trials, <1 minute) is practical but may limit the richness of extractable process signatures. Additionally, as the authors acknowledge, adversarial adaptation could erode discriminative power over time.

4. Timeliness & Relevance

This paper is highly timely. The rapid improvement of frontier models on traditional benchmarks and CAPTCHAs creates urgent need for more robust human-verification approaches. The paper cites GPT-4.5 being judged human 73% of the time in Turing Tests and frontier vision models solving reCAPTCHAv2. The inclusion of GPT-5 (a very recent model) demonstrates currency. The adversarial framing — acknowledging that any discriminator may become a target for optimization — reflects mature thinking about the arms-race dynamics in this space.

5. Strengths & Limitations

Key Strengths:

Conceptual clarity: The output-vs-process distinction is well-motivated from both AI and cognitive science perspectives, with clear connections to philosophical literature (Block, Turing).

Comprehensive task battery: 30 tasks across 5 cognitive domains with 129 process features represents substantial engineering and domain expertise.

Practical constraints: The short-form task design acknowledges deployment realities.

Honest about limitations: The paper transparently reports where P-SFT fails (cross-task transfer) and discusses the adversarial nature of the problem.

Notable Weaknesses:

Narrow fine-tuning scope: Only 3/30 tasks are used for the fine-tuning experiments, and all are discrete sequential decision tasks. Claims about "process specification as a bottleneck" rest on limited evidence.

Feature selection: The 129 process features are hand-designed based on cognitive science literature. There's no analysis of feature redundancy, sensitivity, or which features drive discrimination.

Missing adversarial depth: While framed as red-teaming, the adversarial evaluation is relatively mild. No attempt is made to use the classifier's decision boundary to guide more sophisticated attacks (e.g., reinforcement learning against the detector).

Scalability questions: Whether this approach scales to open-ended interactions (chat, browsing) beyond structured cognitive tasks is unaddressed.

Industry affiliation: Three of four authors are from Roundtable Technologies, which may have commercial interests in human-verification technology.

Overall Assessment

This is a well-conceived paper that makes a clear and timely contribution at the intersection of cognitive science and AI safety/security. The core insight — that process-level features discriminate humans from machines more reliably than outputs — is supported by extensive empirical evidence across the 30-task battery. The fine-tuning analysis, while limited in scope, provides useful insights about the difficulty of process mimicry. The work would benefit from broader fine-tuning experiments, more formal statistical analysis, and deeper adversarial evaluation, but the contribution is solid and addresses a pressing need.

Rating:6.8/ 10

Significance 7.5Rigor 6Novelty 7Clarity 8

Generated May 8, 2026

Comparison History (24)

vs. AI scientists produce results without reasoning scientifically

gpt-5.25/16/2026

Paper 1 has higher likely scientific impact due to broader relevance and timeliness: it directly interrogates the reliability of autonomous “AI scientist” systems across eight scientific domains with large-scale empirical evaluation (25,000+ runs) and identifies a fundamental limitation—lack of epistemically normative reasoning—that affects trustworthiness of AI-generated science. Its claims (scaffolds contribute little; outcome-only evaluation fails; reasoning must be a training target) can reshape evaluation practices and model-training priorities across AI-for-science, agent design, and research governance. Paper 2 is valuable but narrower (human–machine discrimination via cognitive tasks) and more application-specific.

vs. FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis

gpt-5.25/16/2026

Paper 1 has higher potential scientific impact due to its novel shift from output-based to process-based human–machine discrimination, introducing a general-purpose cognitive task battery (CogCAPTCHA30) and demonstrating strong separability even under output matching. Its implications span AI evaluation, cognitive science, security/anti-bot systems, and alignment/behavioral mimicry, with timely relevance as frontier agents proliferate. Methodologically, it compares multiple model classes and fine-tuning regimes and identifies a key limitation (process representation transfer), offering broadly applicable insights beyond a single application domain.

vs. M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models

claude-opus-4.65/16/2026

Paper 2 introduces a fundamentally new paradigm for human-machine discrimination based on cognitive processes rather than outputs, with broad implications across AI safety, cognitive science, security (CAPTCHAs), and the philosophy of intelligence. Its interdisciplinary nature, timeliness given rapid LLM/agent deployment, and the novel CogCAPTCHA30 benchmark give it wider impact potential. Paper 1, while technically strong with practical results on SWE-Bench, addresses a narrower model merging problem with more incremental contributions to the existing reasoning enhancement literature.

vs. BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

claude-opus-4.65/16/2026

Paper 2 introduces a fundamentally novel paradigm shift—evaluating cognitive processes rather than outputs for human-machine discrimination. This has broad implications across AI safety, cognitive science, authentication systems, and AI alignment. The CogCAPTCHA framework addresses the timely and critical challenge of distinguishing humans from increasingly capable AI agents. Paper 1, while valuable as an engineering benchmark for CAD code generation, serves a narrower community. Paper 2's cross-disciplinary relevance (cognitive science, security, ML) and conceptual innovation give it higher potential for widespread scientific impact.

vs. Engagement Process: Rethinking the Temporal Interface of Action and Observation

claude-opus-4.65/16/2026

Paper 2 addresses the timely and broadly impactful problem of human-machine discrimination as AI agents become increasingly capable. CogCAPTCHA30 offers a practical, novel framework grounded in cognitive science that shifts evaluation from outputs to processes—relevant to AI safety, security, and cognitive science. It benchmarks frontier models (GPT-5, Claude, Gemini) and introduces actionable fine-tuning methods. Paper 1 proposes a useful but more niche formalism (Engagement Process) for temporal interaction modeling that extends POMDPs. While technically interesting, its impact is narrower, primarily within RL/agent design communities.

vs. Abductive Reasoning with Probabilistic Commonsense

claude-opus-4.65/16/2026

Paper 2 introduces a fundamentally new paradigm for human-machine discrimination based on cognitive processes rather than outputs, which has broad implications across AI safety, cognitive science, and security. CogCAPTCHA30 addresses the increasingly urgent problem of distinguishing humans from LLMs/agents, with clear real-world applications (CAPTCHAs, online authentication, AI safety). It evaluates frontier models (GPT-5, Claude Sonnet 4.5) and proposes novel fine-tuning approaches. Its interdisciplinary nature spanning cognitive science and AI, combined with timeliness as AI capabilities advance, gives it broader impact potential than Paper 1's more incremental improvement to neurosymbolic reasoning.

vs. Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

gpt-5.25/16/2026

Paper 1 has higher likely scientific impact due to a more novel cross-disciplinary contribution: jointly benchmarking frontier LRMs on real human learning behavior, game performance, and fMRI brain-prediction in complex, naturalistic tasks, with strong quantitative gains and mechanistic ablations (state representation vs planning). This tightly connects AI evaluation with cognitive neuroscience and may influence both model development and theories of human learning. Paper 2 is timely and useful for human–machine discrimination and evaluation, but its impact is more application/policy-facing and may be narrower scientifically, with process features potentially task- and dataset-specific.

vs. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

gpt-5.25/16/2026

Paper 1 has higher likely scientific impact due to stronger novelty (process-level human–machine discrimination via a cognitively grounded task battery), broader cross-field relevance (AI evaluation, cognitive science, security/anti-bot, human-computer interaction), and clear real-world application need as LLM agents proliferate. The introduction of CogCAPTCHA30 and empirical evidence that process features outperform outputs (AUC 0.88) provide a durable evaluation framework and new research direction (process specification bottleneck). Paper 2 is timely and useful for RLVR training, but is a more incremental algorithmic improvement with narrower scope.

vs. MMSkills: Towards Multimodal Skills for General Visual Agents

claude-opus-4.65/16/2026

Paper 1 introduces a fundamentally novel paradigm for human-machine discrimination by shifting from output-based to process-based evaluation, grounded in cognitive science. CogCAPTCHA30 addresses the increasingly urgent problem of distinguishing humans from AI agents as LLMs improve, with broad implications for security, AI safety, and cognitive science. The rigorous methodology (AUC=0.88, multiple model comparisons, fine-tuning approaches) and the conceptual insight that process specification is a bottleneck for human-like AI are highly impactful. Paper 2, while solid engineering, is more incremental—extending skill frameworks to be multimodal for visual agents, with narrower scope and less conceptual novelty.

vs. GenCircuit-RL: Reinforcement Learning from Hierarchical Verification for Genetic Circuit Design

claude-opus-4.65/16/2026

Paper 1 addresses a fundamental and timely problem—distinguishing humans from AI—with broad implications across security, AI safety, cognitive science, and online trust. Its novel cognitive process-based approach (CogCAPTCHA30) offers a paradigm shift from output-based evaluation, with wide applicability as LLMs become ubiquitous. Paper 2 is strong applied work in synthetic biology but addresses a narrower domain. Paper 1's interdisciplinary breadth, timeliness given rapid AI deployment, and conceptual novelty (process vs. output distinction) give it higher potential for broad scientific impact.

vs. OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

gemini-3.15/16/2026

Paper 2 introduces a fundamental paradigm shift by moving from output-based to process-based evaluation for human-machine discrimination. This addresses a critical, timely issue in AI safety and security with broad implications across cognitive science, AI evaluation, and bot detection. While Paper 1 offers a valuable methodological improvement for traffic signal control, Paper 2 has significantly broader cross-disciplinary applicability, a more novel conceptual framework, and evaluates frontier models on a systemic level, indicating a higher potential for widespread scientific impact.

vs. From History to State: Constant-Context Skill Learning for LLM Agents

gemini-3.15/16/2026

Paper 2 introduces a paradigm shift in human-machine discrimination by focusing on cognitive processes rather than outputs, addressing a critical vulnerability in the era of advanced LLMs. Its interdisciplinary approach, novel benchmark (CogCAPTCHA30), and broad implications for AI safety, bot detection, and cognitive modeling offer wider and more fundamental scientific impact compared to Paper 1's methodological improvements in agent context efficiency.

vs. Intermediate Layers Encode Optimal Biological Representations in Single-Cell Foundation Models

gemini-3.15/16/2026

Paper 2 addresses a critical, highly timely challenge with broad societal and interdisciplinary implications: distinguishing humans from AI agents. By shifting the paradigm from output-based evaluation (Turing test) to process-level cognitive features, it offers a novel approach with wide applications in AI safety, cybersecurity, and cognitive science. While Paper 1 provides valuable insights for single-cell genomics and foundation model feature extraction, its impact is relatively confined to computational biology, whereas Paper 2's findings fundamentally impact how we evaluate and interact with increasingly autonomous AI systems across all domains.

vs. Hidden in Plain Sight: Visual-to-Symbolic Analytical Solution Inference from Field Visualizations

gemini-3.15/16/2026

Paper 1 addresses a critical, timely problem in AI safety and bot detection by fundamentally shifting the evaluation paradigm from output-matching to process-tracking. Its implications span cognitive science, AI evaluation, and web security, offering broader societal and interdisciplinary scientific impact than Paper 2, which focuses on a more specialized domain of AI-assisted physics reasoning.

vs. From Coordinate Matching to Structural Alignment: Rethinking Prototype Alignment in Heterogeneous Federated Learning

claude-opus-4.65/8/2026

Paper 1 introduces a fundamentally novel framework (CogCAPTCHA30) for human-machine discrimination based on cognitive processes rather than outputs, addressing a critical and timely problem as LLMs become more capable. It bridges cognitive science and AI safety, has broad implications for security (CAPTCHAs, bot detection), AI alignment, and understanding machine cognition. The work evaluates multiple frontier models and fine-tuning strategies, offering deep methodological rigor. Paper 2, while solid, addresses a more incremental improvement in heterogeneous federated learning with narrower scope and more limited cross-field impact.

vs. TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning

claude-opus-4.65/8/2026

Paper 1 introduces a novel paradigm shift in human-machine discrimination by focusing on cognitive processes rather than outputs, with broad implications across AI safety, cognitive science, and online security. The CogCAPTCHA30 benchmark and the systematic comparison of fine-tuning approaches offer deep methodological contributions. Paper 2, while practically useful, represents an incremental engineering contribution (iterative refinement pipelines are well-established) in a narrower domain. Paper 1's conceptual framework—process-based evaluation—has wider cross-disciplinary impact and addresses a fundamental, timely challenge as AI capabilities advance.

vs. When Do We Need LLMs? A Diagnostic for Language-Driven Bandits

claude-opus-4.65/8/2026

Paper 2 introduces a fundamentally novel paradigm shift—evaluating cognitive processes rather than outputs for human-machine discrimination. CogCAPTCHA30 has broad implications across AI safety, security (CAPTCHA systems), cognitive science, and the philosophical question of machine intelligence. It engages with frontier models (GPT-5, Claude 4.5, Gemini 2.5 Pro) and proposes a framework applicable as LLMs become more capable. Paper 1, while practically useful, addresses a narrower optimization problem (cost-effective bandits in finance) with more incremental contributions. Paper 2's cross-disciplinary impact and timeliness give it higher potential.

vs. When Do We Need LLMs? A Diagnostic for Language-Driven Bandits

gemini-3.15/8/2026

Paper 1 proposes a foundational paradigm shift in AI evaluation, moving from output-based criteria to process-based cognitive metrics. By introducing CogCAPTCHA30, it addresses an urgent, cross-disciplinary challenge: distinguishing humans from increasingly sophisticated AI agents. This has immense real-world applications in cybersecurity, AI safety, and cognitive science. While Paper 2 offers a rigorous framework for cost-effective deployment of LLMs in bandit problems, its impact is narrower, primarily optimizing financial and recommendation systems. Paper 1's broader implications for fundamental human-machine differentiation and AI alignment give it a significantly higher potential scientific impact.

vs. Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

gemini-3.15/8/2026

Paper 2 offers a paradigm shift from output-based to process-based evaluation for human-machine discrimination, fundamentally challenging traditional Turing Test approaches. Its interdisciplinary bridge between cognitive science, AI safety, and cybersecurity gives it broader scientific impact compared to Paper 1, which, while highly innovative in mitigating long-video context limits, represents a more domain-specific methodological advancement in computer vision and RAG frameworks. The introduction of CogCAPTCHA30 provides a highly timely and widely applicable benchmark for the pressing issue of bot detection.

vs. BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

claude-opus-4.65/8/2026

Paper 1 introduces a fundamentally novel framework for human-machine discrimination based on cognitive processes rather than outputs, which addresses a timely and increasingly critical problem as LLMs become more capable. The CogCAPTCHA30 battery and the theoretical insight that process-level features outperform output-matching metrics (AUC=0.88) represent a paradigm shift with broad implications for AI safety, authentication, and cognitive science. Paper 2, while practically useful as an engineering toolkit for biomedical agent evaluation, is more incremental—primarily a systems contribution that standardizes existing evaluation practices rather than introducing new scientific insights.