When Do Data-Driven Systems Exhibit the Capability to Infer?

Maximilian Poretschkin, Tabea Naeven

Jun 10, 2026arXiv:2606.11769v1

cs.AIcs.LG

#3207of 3489·Artificial Intelligence

#3207 of 3489 · Artificial Intelligence

Tournament Score

1250±48

10501800

29%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance7

Rigor5.5

Novelty6.5

Clarity7.5

Abstract

The European AI Act is the first comprehensive regulation of artificial intelligence (AI), setting out extensive obligations, particularly for so-called high-risk and general-purpose AI systems. A key distinguishing feature of AI systems under the AI Act is the capability to infer. Since the AI Act does not clearly define what inference is, there is a gray area for certain data-driven systems. A specific example is credit scoring systems, which are listed by Annex III of the AI Act. At the same time, however, these are often implemented using statistical models for which it is unclear whether they have the capability to infer and thus fall under the AI definition of the AI Act at all. Motivated by statistical learning theory, this work develops a framework for grading different levels of the capability to infer. Based on the AI Act and the Commission Guidelines on the definition of an artificial intelligence system, we analyze which levels constitute sufficient capability to infer within the meaning of the AI Act and where further regulatory clarity is needed. We illustrate the framework by creating two realistic credit scoring workflows and show whether and where inference occurs in them. Our analysis illustrates that not only individual models but the entire data processing workflow must be considered. It also shows that the involvement of human experts during development can have significant influence on the capability to infer. Code can be found at https://github.com/fraunhofer-iais/inference-framework-creditscorecards.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper tackles a genuinely important regulatory gap: the EU AI Act designates "capability to infer" as the key distinguishing feature of AI systems, yet provides no precise operationalization of this concept. The authors propose a five-level staged framework (Levels 0–4) that grades inference capability according to how much data shapes the input-output mapping—from fixed mappings (Level 0) through parametric adaptation (Level 1), structural selection (Level 2), explicit and implicit structural construction (Level 3a/3b), to representational construction (Level 4). The framework is grounded in statistical learning theory, particularly Breiman's two-cultures distinction, and is illustrated through two realistic credit scoring workflows.

The paper's most valuable insight is that the inference capability of a system must be assessed at the workflow level, not at the individual model level. A logistic regression model in isolation is Level 1, but when preceded by data-driven decision-tree binning, the overall system operates at Level 3a. This workflow-level perspective is practically important and non-obvious.

2. Methodological Rigor

The framework is conceptually well-constructed, with each level clearly defined and illustrated with canonical examples. The progression from Level 0 to Level 4 is logically coherent: each level strictly extends the previous one in terms of data's role in shaping the mapping.

However, the framework is primarily taxonomic rather than formally rigorous. The authors acknowledge this limitation—they note that VC dimension could provide a formal measure of hypothesis space complexity but exclude it "for reasons of practical applicability." While this is understandable, it means the framework relies on qualitative judgments about where a system falls. The boundary between levels can be fuzzy (e.g., the footnote about small vs. large numbers of candidate splits potentially shifting between Level 2 and Level 3 reveals this ambiguity).

The credit scoring case study is well-executed: two realistic workflows are built on the same dataset (Kaggle's Give Me Some Credit), using identical preprocessing but different binning and feature selection approaches. The comparison is informative and clearly demonstrates how workflow design choices affect the inference classification. The code is publicly available, supporting reproducibility. However, the dataset is relatively small (150K instances, 10 features) compared to industry practice, as the authors acknowledge.

The legal analysis is careful but necessarily inconclusive. The authors correctly identify the confusion in the Commission Guidelines' Article 42 regarding regression models and "basic data processing," and they present both possible interpretations of where the threshold lies. This intellectual honesty is appropriate given the genuine legal ambiguity.

3. Potential Impact

The paper sits at the intersection of computer science and AI regulation—a space of enormous practical importance. With the AI Act now in force and industry struggling to determine compliance obligations, a principled framework for assessing whether specific systems qualify as "AI" under the Act has direct regulatory value.

For industry practitioners: The framework provides a structured methodology for self-assessment. Credit scoring is a multi-billion-euro industry, and the classification of scoring systems as AI (or not) has significant compliance cost implications.

For regulators and standardization bodies: The framework could inform guidance documents, enforcement decisions, and standards development (e.g., within CEN/CENELEC working groups).

For legal scholars: The paper advances the interdisciplinary dialogue between computer science and law on AI governance.

The impact is somewhat limited by the fact that the framework cannot itself determine the regulatory threshold—this ultimately requires political/legal decisions. The paper also does not address symbolic AI systems, limiting its generality within the AI Act's scope.

4. Timeliness & Relevance

The timing is excellent. The AI Act entered into force in 2024, with compliance deadlines phased over 2025–2027. Industry is actively grappling with scope questions. The Commission Guidelines were published in early 2025, and this paper provides a timely critical analysis of their ambiguities. The credit scoring use case is particularly well-chosen given its explicit mention in Annex III and the active debate about whether logistic regression constitutes AI.

5. Strengths & Limitations

Key Strengths:

Addresses a real, consequential regulatory ambiguity with practical implications

The staged framework is intuitive, well-motivated by learning theory, and practically applicable

The workflow-level analysis is an important conceptual contribution—showing that preprocessing steps like binning can elevate a system's inference level

The demonstration that human intervention can destroy or preserve inference capability is nuanced and practically relevant

Code availability supports reproducibility

Careful and honest treatment of legal ambiguities, avoiding overclaiming

Notable Limitations:

The framework is taxonomic rather than formally grounded; boundary cases remain judgment calls

Only supervised learning workflows are analyzed in depth; unsupervised and reinforcement learning are mentioned but not explored

Symbolic/knowledge-based AI systems are not addressed

The legal analysis, while careful, is necessarily preliminary and non-authoritative

The credit scoring examples, while illustrative, use a relatively simple dataset

The paper does not discuss how the framework handles hybrid systems that combine components at different levels in non-linear ways (beyond the simple "maximum level" rule)

Limited engagement with how other jurisdictions (US Executive Order, etc.) handle similar definitional questions

Additional Observations

The "maximum level rule" (the component with the highest inference level determines the system's overall level) is stated but not justified. One could argue that a Level 3 binning step whose outputs are entirely overridden by human judgment should not elevate the system's classification—the authors partially address this but a more systematic treatment would strengthen the framework.

The paper would benefit from engaging more deeply with the standardization landscape (ISO/IEC, CEN/CENELEC) where such frameworks might find formal adoption.

Rating:6.2/ 10

Significance 7Rigor 5.5Novelty 6.5Clarity 7.5

Generated Jun 11, 2026

Comparison History (17)

Wonvs. Towards Responsibly Non-Compliant Machines

Paper 2 has higher likely impact due to a more concrete, timely contribution: a formal framework (grounded in statistical learning theory) to operationalize “capability to infer” in the EU AI Act, illustrated with realistic credit-scoring workflows and accompanied by code. This offers actionable guidance for regulators and practitioners, with immediate real-world application in compliance and risk assessment across many deployed data-driven systems. Paper 1 raises important conceptual issues around responsible non-compliance, but appears more agenda-setting and less methodologically specified, making near-term uptake and measurable impact less certain.

gpt-5.2·Jun 11, 2026

Wonvs. PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents

Paper 2 addresses the EU AI Act's definition of 'capability to infer,' a foundational regulatory question affecting all AI systems in Europe. It provides a novel theoretical framework grounded in statistical learning theory with broad implications for AI governance, compliance, and policy interpretation across industries. Its interdisciplinary contribution (law + machine learning) is timely and relevant to a massive regulatory landscape. Paper 1, while practically useful, is a narrowly scoped engineering tool evaluated only through a single-person self-study, limiting its scientific rigor and broader impact.

claude-opus-4-6·Jun 11, 2026

Wonvs. A Reliable Fault Diagnosis Method Based on Belief Rule Base Consider Robustness Analysis

Paper 2 addresses a timely and broadly impactful topic at the intersection of AI regulation and technical practice. The EU AI Act affects countless organizations globally, and providing a rigorous framework for determining when data-driven systems qualify as AI under the Act has significant practical, legal, and policy implications. Its interdisciplinary nature (spanning machine learning theory, law, and finance) gives it broader reach. Paper 1, while solid, addresses a more incremental improvement to fault diagnosis methodology within a narrower technical domain with less cross-disciplinary impact.

claude-opus-4-6·Jun 11, 2026

Lostvs. MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

Paper 1 presents a highly innovative, technically rigorous framework for social intelligence reasoning, integrating multi-agent systems, multimodal LLMs, knowledge distillation, and Test-Time Adaptation. Its achievement of state-of-the-art results and provision of open-source models, code, and datasets give it strong potential for high scientific impact and immediate utility in the AI research community. While Paper 2 is timely for AI policy, Paper 1 offers broader, foundational advancements in core AI methodology.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

Paper 1 likely has higher scientific impact: it proposes a novel, technically concrete agent architecture (claim-level market mechanism + program synthesis + verifier) and demonstrates strong empirical results across many established benchmarks, enabling immediate adoption for high-stakes numerical/financial QA. Its methodological contribution is directly testable and broadly extensible to other grounded reasoning domains (science, auditing, data analysis). Paper 2 is timely and valuable for AI governance, but it is primarily a conceptual/legal-technical framework with narrower, policy-centric scientific traction and less generalizable empirical validation.

gpt-5.2·Jun 11, 2026

Lostvs. Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

Paper 1 has higher likely scientific impact due to a concrete, technically novel ML contribution (domain knowledge graph + KG-constrained reasoning chains + KG-guided RL) with direct clinical decision-support potential and reusable resources (LungKG). It reports empirical SOTA gains across multiple benchmarks, indicating methodological rigor and immediate relevance to healthcare AI. Paper 2 is timely and important for AI governance, but its main output is a conceptual/regulatory framework with narrower scientific/technical novelty and less clearly generalizable empirical validation, yielding more limited cross-field methodological impact.

gpt-5.2·Jun 11, 2026

Lostvs. Can AI Agents Synthesize Scientific Conclusions?

Paper 1 addresses a critical and highly relevant challenge across all sciences: the reliability of AI agents in synthesizing scientific literature. By introducing a rigorous benchmark and clean-room evaluation to prevent data leakage, it exposes significant flaws in frontier models. Its impact spans AI development, healthcare, and general scientific methodology. Paper 2, while important for policy, is narrowly focused on regulatory compliance and interpreting the EU AI Act, limiting its broader scientific innovation compared to Paper 1.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

Paper 2 has higher estimated scientific impact due to a novel, generalizable method for long-horizon LLM agents (hierarchical memory plus RL-based navigation) addressing a widely recognized bottleneck (context limits, cost, latency). It demonstrates methodological rigor with multi-benchmark evaluation and clear efficiency/performance gains, and has broad applicability across agentic systems, RAG, and robotics/task automation—highly timely given rapid adoption of LLM agents. Paper 1 is important for AI governance clarity but is more domain- and jurisdiction-specific, with narrower technical spillover beyond regulatory/credit-scoring contexts.

gpt-5.2·Jun 11, 2026

Lostvs. AutoMine Solution for AV2 2026 Scenario Mining Challenge

Paper 2 likely has higher scientific impact: it introduces a novel, timely LLM/VLM-driven self-refining scenario mining pipeline with execution-feedback code refinement and demonstrates competitive performance on a prominent CVPR 2026 benchmark/competition, with clear real-world relevance to autonomous driving safety evaluation and potential reuse across robotics/ML. Paper 1 is valuable for AI regulation clarity, but its contribution is more domain-specific (legal/definition of inference under the EU AI Act) and less likely to drive broad technical adoption or cross-field methodological advances at scale.

gpt-5.2·Jun 11, 2026

Lostvs. INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

Paper 1 (INFRAMIND) addresses a critical and timely gap in multi-agent LLM orchestration by incorporating infrastructure awareness, demonstrating strong empirical results (7.6pp accuracy gain, 7x lower latency, 99.9% SLO compliance). It introduces a novel hierarchical constrained MDP framework with broad applicability to the rapidly growing LLM deployment ecosystem. Paper 2, while valuable for AI regulation discourse, is more narrowly scoped to EU AI Act interpretation and credit scoring, with impact limited primarily to the legal/policy domain rather than driving broad scientific or technical advances.

claude-opus-4-6·Jun 11, 2026

#3207of 3489·Artificial Intelligence

#3207 of 3489 · Artificial Intelligence

Tournament Score

1250±48

10501800

29%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance7

Rigor5.5

Novelty6.5

Clarity7.5