FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

Leonardo Bertolazzi, Katya Tentori, Raffaella Bernardi

Jun 3, 2026

arXiv:2606.04751v1 PDF

cs.AI(primary)

#1497of 3404·Artificial Intelligence

#1497 of 3404 · Artificial Intelligence

Tournament Score

1417±46

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance6.5

Rigor7

Novelty6.5

Clarity8

Tournament Score

1417±46

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively engage in forms of inductive reasoning relevant to scientific discovery remains an open question. In this work, we introduce FALSIFYBENCH, an evaluation framework for hypothesis-driven reasoning inspired by the classic Wason 2-4-6 task, in which agents must discover hidden semantic properties by iteratively proposing examples and receiving feedback. This task captures key elements of scientific reasoning: hypothesis generation, evidence gathering, and belief revision in response to both confirming and disconfirming evidence. Our evaluation of 12 LLMs across model families and scales shows that reasoning models are generally stronger scientific reasoners than instruction-tuned models, although no model comes close to optimal performance. The primary driver of success is the capacity for negative testing: models that actively seek to falsify their hypotheses consistently outperform those that primarily seek confirmation. Moreover, a fine-grained turn-level analysis, neglected in previous work, reveals that failure is tied to identifiable patterns in how models navigate the hypothesis space.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: FALSIFYBENCH

1. Core Contribution

FALSIFYBENCH introduces an evaluation framework for hypothesis-driven inductive reasoning in LLMs, adapting the classic Wason 2-4-6 task from cognitive psychology into a semantic domain. The key innovation is threefold: (a) replacing numerical rules with semantic taxonomies drawn from WordNet, which mitigates contamination concerns and enables richer hypothesis spaces; (b) grounding the evaluation in Klayman and Ha's (1987) normative framework, which distinguishes when positive vs. negative testing is the appropriate strategy based on the set-theoretic relationship between hypothesis H and target rule R; and (c) providing fine-grained turn-level analysis of how models navigate hypothesis space, going beyond aggregate success metrics.

The benchmark operationalizes a specific but important aspect of scientific reasoning: the ability to falsify overly narrow hypotheses through negative testing. The design ensures H⊂R configurations dominate (as in Wason's original setup), making negative testing the uniquely informative strategy.

2. Methodological Rigor

Strengths in design: The dual-role architecture (each LLM serves as both player and oracle) is clever, enabling self-contained evaluation without human-in-the-loop costs. The authors appropriately control for oracle quality through human annotation and a Bayesian mixed-effects logistic regression, demonstrating that confirmation bias rather than oracle error drives failure. This is a methodologically careful disentanglement.

Annotation and analysis: The turn-level annotation of H-R relations using GPT-5-Mini as an offline annotator, combined with the surface-level linguistic feature classifier via regex patterns, provides interpretable diagnostic information. The statistical analyses (Spearman correlations, Mann-Whitney U tests, Fisher's exact tests, Bayesian regression) are appropriate for the data structure.

Weaknesses: The curated set of 100 games is relatively small, and only 5 of 7 initial target categories survived filtering, raising questions about coverage. The oracle error analysis relies on sampling just one Test and one Guess turn per game for human annotation—the authors acknowledge this limitation but argue it is conservative. Using GPT-5-Mini as the H-R relation annotator introduces potential systematic biases that are not validated against human judgments. The temperature and sampling parameters defaulting to provider settings introduces uncontrolled variability across models. Finally, the 20-turn limit, while practical, is somewhat arbitrary and may differentially penalize models with different exploration strategies.

3. Potential Impact

For AI evaluation: The benchmark fills a genuine gap between static reasoning benchmarks and fully open-ended scientific discovery tasks. It provides a controlled yet dynamic environment that captures the iterative nature of hypothesis testing. The connection to confirmation bias—a well-studied phenomenon in cognitive science—creates a bridge between AI evaluation and decades of psychological research.

For scientific AI agents: As LLMs are increasingly deployed in scientific workflows (the paper cites AI Scientist, AI co-scientist), understanding their falsification capabilities is practically important. The finding that even the best model (GPT-5.2-Chat at 75%) falls well short of optimal performance is a useful calibration for the field.

For cognitive science: The framework could serve as a tool for computational cognitive science, enabling large-scale comparisons between human and LLM reasoning strategies in hypothesis testing.

Limitations in scope: The benchmark tests a narrow (though important) slice of scientific reasoning. Real scientific discovery involves theory formation, experimental design with continuous variables, causal reasoning, and dealing with noisy observations—none of which are captured here. The semantic taxonomy domain, while better than numbers for contamination, is still quite constrained.

4. Timeliness & Relevance

The paper is highly timely. The rapid deployment of LLMs as autonomous scientific agents (Lu et al., 2024; Gottweis et al., 2025) makes it urgent to understand their reasoning limitations. The benchmark directly addresses whether these systems can engage in the kind of hypothesis-driven inquiry that underpins the scientific method. The focus on falsification is particularly apt given Popper's lasting influence on philosophy of science and the recent surge in agentic AI systems.

The work also arrives in a moment when the field is grappling with how to evaluate reasoning beyond pattern matching—the distinction between models that merely retrieve plausible hypotheses versus those that actively test and revise them is increasingly important.

5. Strengths & Limitations

Key strengths:

Well-grounded in cognitive science theory (Wason, Klayman & Ha), providing normative baselines for evaluation

The turn-level analysis revealing failure modes (partial overlap hypotheses, surface-level linguistic features) is genuinely insightful and actionable

Strong empirical finding: the Spearman ρ = -0.937 between confirmation bias and conclusive falsification rate is striking

The Bayesian regression convincingly disentangles player strategy from oracle quality

Good model coverage (12 models across families and scales)

Code and data publicly available

Notable weaknesses:

100 games is a modest evaluation set; statistical power for per-category or per-model-pair comparisons is limited

The reliance on WordNet taxonomy means the "correct" categorizations are sometimes debatable (as the authors acknowledge)

No human baseline is reported, which would have been valuable for calibrating model performance against the cognitive science literature

The paper does not explore interventions (unlike Jhaveri et al., 2026, who test mitigation strategies for confirmation bias)

Single-run evaluation without confidence intervals on success rates limits reliability claims

The exclusive focus on H⊂R configurations, while well-motivated, means the benchmark does not test the full range of scientific reasoning scenarios

Additional Observations

The qualitative game traces in the appendix (Tables 8-9) are illuminating—the failed game shows a model spiraling into orthographic hypotheses ("contains the letter 'o'", "word ends with a consonant"), which is a striking failure mode. The finding that stronger models' failures are almost exclusively characterized by surface-level linguistic hypotheses (92-97% for GPT-5.2-Chat and GLM-5) suggests a specific, potentially addressable weakness.

The paper makes a reasonable but somewhat grandiose framing connecting the task to scientific discovery. The Wason task captures hypothesis testing but lacks many other elements of real scientific reasoning. The contribution would be equally strong with more modest framing.

Rating:6.8/ 10

Significance 6.5Rigor 7Novelty 6.5Clarity 8

Generated Jun 5, 2026

Comparison History (16)

vs. The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

gemini-3.16/8/2026

Paper 1 introduces a concrete, empirical benchmark (FALSIFYBENCH) targeting a critical and highly relevant capability: scientific inductive reasoning and falsification in LLMs. It provides actionable insights and immediate utility for evaluating new models. In contrast, Paper 2 is primarily a position or agenda paper; while it offers a valuable conceptual bridge between classical control and foundation models, agenda papers typically have less immediate measurable impact than widely adopted empirical benchmarks.

vs. Multilingual Fine-Tuning via Localized Gradient Conflict Resolution

claude-opus-4.66/6/2026

Paper 1 addresses a fundamental and practically important problem in multilingual LLM fine-tuning with both theoretical contributions (proving Refined Pareto Stationarity) and empirical validation across multiple models. Its scalable distributed framework for gradient conflict resolution has broad applicability beyond multilingual settings to any multi-objective fine-tuning scenario. Paper 2, while interesting as a benchmark for evaluating inductive reasoning, is primarily diagnostic—it characterizes LLM limitations but doesn't provide solutions. Paper 1's methodological innovation with immediate practical utility for the growing multilingual AI community gives it higher potential impact.

vs. Universal Quantum Transformer

gpt-5.26/6/2026

Paper 1 is more likely to have near-term scientific impact: it introduces a concrete, reproducible benchmark for inductive/hypothesis-driven reasoning in LLM agents, provides comparative results across many existing models, and yields actionable insights (importance of falsification/negative testing, turn-level failure modes) relevant to AI evaluation and agent design. Paper 2 is highly novel but makes extraordinarily broad claims (“universally superior”) on limited demonstrations (5-qubit tasks) and depends on nascent quantum hardware; impact may be high if validated, but methodological and plausibility risks reduce expected impact versus a deployable, widely usable evaluation framework.

vs. Structure Enables Effective Self-Localization of Errors in LLMs

gpt-5.26/5/2026

Paper 2 likely has higher impact: it introduces a broadly useful benchmark for hypothesis-driven inductive reasoning with clear ties to scientific discovery workflows, enabling standardized evaluation and downstream method development across many labs. Its findings (importance of falsification/negative testing, turn-level failure modes) are general and actionable for model training, agent design, and interpretability. Paper 1 is innovative and application-relevant for self-correction, but is more narrowly scoped to a prompting/self-editing technique and depends on verification assumptions; its impact is likely more incremental compared to a new evaluation paradigm.

vs. LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

gpt-5.26/5/2026

Paper 2 likely has higher impact due to stronger real-world relevance and broader cross-field utility: it probes long-horizon planning over a large, naturally occurring knowledge graph (Wikipedia), connecting to agent navigation, tool use, web reasoning, and planning research. Its benchmark is easy to operationalize, has clear difficulty scaling, and includes a wide evaluation across frontier models with notable failure modes (replanning/loops). Paper 1 is novel and scientifically motivated (falsification-focused inductive reasoning), but its more synthetic setup may limit immediate adoption and downstream applications compared to Wiki-based planning.

vs. AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

gemini-3.16/5/2026

Paper 2 addresses a critical and highly relevant gap in AI research: evaluating long-horizon, iterative autonomous agents for real-world R&D tasks. Its focus on closed-loop optimization across diverse domains like systems and model development offers broader, more immediate practical applications than Paper 1's narrower focus on a specific cognitive task (Wason 2-4-6). AutoLab's potential to accelerate the development of 'AI scientists' gives it a higher trajectory for widespread scientific and engineering impact.

vs. MAVEN-T: Reinforced Heterogeneous Distillation for Real-Time Multi-Agent Trajectory Prediction

gpt-5.26/5/2026

Paper 2 likely has higher impact due to strong real-world applicability (real-time autonomous driving), clear engineering relevance, and broad evaluation across multiple large-scale datasets with deployment-focused metrics (latency on Jetson). The combination of heterogeneous distillation with reinforcement learning for safety-aligned prediction is timely and could influence both research and industry practice. Paper 1 is novel and valuable for LLM evaluation and scientific reasoning diagnostics, but as a benchmark it may have narrower immediate application and impact than a method improving safety-critical, deployable prediction systems.

vs. Parthenon Law: A Self-Evolving Legal-Agent Framework

claude-opus-4.66/5/2026

FALSIFYBENCH addresses a fundamental question about LLM reasoning capabilities—inductive reasoning and hypothesis falsification—which is broadly relevant across AI, cognitive science, and philosophy of science. Its findings about negative testing as the key driver of success and the turn-level failure analysis provide generalizable insights for the entire field. Paper 2, while practically valuable, is more narrowly focused on a domain-specific agent framework for legal applications, with less generalizable scientific contributions. FALSIFYBENCH's benchmark methodology and cognitive science-grounded evaluation will likely influence a wider range of future research.

vs. FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

gpt-5.26/5/2026

Paper 2 likely has higher impact: it introduces a large, automatically generated and verifiable multimodal benchmark that tests global, topology-sensitive reasoning and full algebraic derivation—capabilities central to many scientific diagram domains beyond physics (circuits, chemistry, causal graphs). Its scale (2,000+), standardized pipeline, and stark performance gaps provide a strong, reusable diagnostic for model development. Paper 1 is novel in probing falsification-driven inductive reasoning, but its game-like setting may generalize less directly to deployed multimodal scientific workflows than diagram-to-structure-to-math reasoning.

vs. Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI

claude-opus-4.66/5/2026

FALSIFYBENCH addresses a fundamental question about LLM reasoning capabilities relevant to scientific discovery, introducing a novel benchmark grounded in established cognitive science (Wason task). It evaluates 12 models with fine-grained analysis revealing actionable insights about confirmation bias in LLMs. This has broad impact across AI safety, cognitive science, and scientific automation. Paper 2 addresses an important but narrower engineering problem in authorization for agentic AI. While practically useful, it is more incremental, extending existing IAM frameworks rather than revealing fundamental insights about AI capabilities.

vs. PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

claude-opus-4.66/5/2026

FALSIFYBENCH addresses a more fundamental scientific question—whether LLMs can perform inductive reasoning central to scientific discovery—with broader implications for AI-driven science. Its connection to Popper's falsificationism, systematic evaluation of 12 models with fine-grained turn-level analysis, and identification of negative testing as the key driver of success provide deeper methodological and theoretical contributions. While PersistBench identifies important safety risks in long-term memory systems, it addresses a narrower, more applied concern. FALSIFYBENCH's findings about reasoning capabilities have wider cross-disciplinary relevance for AI in scientific research.

vs. Success Conditioning as Policy Improvement: The Optimization Problem Solved by Imitating Success

gpt-5.26/5/2026

Paper 2 has higher potential impact due to a strong unifying theoretical result: it formalizes success conditioning (spanning SFT rejection sampling, goal-conditioned RL, Decision Transformers) as solving a specific trust-region optimization with an automatically set χ² radius, yielding interpretable identities and safety-relevant guarantees (non-degradation, observable failure modes). This is methodologically rigorous and broadly applicable across modern RL/LLM training pipelines, making it timely and likely to influence both theory and practice. Paper 1 is a useful, novel benchmark for scientific-style induction, but its impact is narrower (evaluation-centric) and less likely to reshape core optimization/training paradigms.

vs. Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

claude-opus-4.66/5/2026

Paper 2 (MechSim) addresses a more impactful problem—integrating scientific simulators with LLM reasoning for high-stakes decision-making—with a novel neuro-symbolic framework that has direct real-world applications across multiple domains. While Paper 1 (FalsifyBench) makes a solid contribution by benchmarking inductive reasoning in LLMs with insights about negative testing, it is primarily a diagnostic evaluation benchmark. Paper 2 introduces a new framework (MechSim) with broader applicability to scientific simulation, transparency, and auditability, addressing critical needs in AI-assisted decision-making that span multiple high-stakes fields.

vs. Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints

gpt-5.26/5/2026

Paper 2 is likely to have higher scientific impact due to its broader, more general contribution: a new benchmark/framework for evaluating inductive, hypothesis-driven reasoning in LLM agents, which is timely given rapid deployment of LLMs in science and automation. Its findings (importance of falsification/negative testing, turn-level failure modes) are broadly applicable across ML, cognitive science, and AI safety/evaluation, enabling follow-on work in training and agent design. Paper 1 is strong and application-relevant for clinical EHRs, but its impact is narrower and more domain-specific.

vs. Tree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions

claude-opus-4.66/5/2026

FALSIFYBENCH addresses the timely and high-impact question of whether LLMs can perform scientific inductive reasoning, directly relevant to the rapidly growing field of AI agents for science. Its benchmark methodology is practical, reproducible, and applicable across model families. The finding that negative testing (falsification) is the primary driver of success provides actionable insights for improving LLM reasoning. Paper 2, while mathematically rigorous and theoretically interesting, addresses a narrower audience with its formal complementarity framework for HAI. Its impossibility results for classification are notable but may limit practical uptake. Paper 1's broader relevance to LLM evaluation and scientific discovery gives it higher potential impact.

vs. Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

gemini-3.16/5/2026

Paper 1 explores a fundamental question regarding the capacity of LLMs for inductive scientific reasoning and hypothesis falsification. By bridging cognitive psychology (the Wason task) with LLM evaluation, it provides critical insights into the limitations of current AI agents in scientific discovery. While Paper 2 offers a strong technical algorithmic improvement for multimodal reinforcement learning, Paper 1 has broader interdisciplinary implications for AI, cognitive science, and the deployment of autonomous systems in real-world scientific research, giving it higher potential impact.