Data Language Models: A New Foundation Model Class for Tabular Data

Eda Erol, Giuliano Pezzoli, Ozer Cem Kelahmet

May 7, 2026

arXiv:2605.06290v1 PDF

cs.AI(primary)

#94of 2292·Artificial Intelligence

#94 of 2292 · Artificial Intelligence

Tournament Score

1545±46

10501800

95%

Win Rate

Wins

Losses

Matches

Rating

4/ 10

Significance5.5

Rigor2.5

Novelty5

Clarity5.5

Tournament Score

1545±46

10501800

95%

Win Rate

Wins

Losses

Matches

Rating

4/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. Every approach to tabular AI today, from gradient-boosted trees to the latest tabular foundation models, requires a preprocessing pipeline before any model can consume the data. None of them understand tabular data as a modality. We introduce the Data Language Model (DLM), the missing foundation model for tabular data. A DLM understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values. It is the tabular data layer on which AI models, agents, and vertical AI applications can be built, eliminating the preprocessing pipelines that currently stand between raw data and every AI system that consumes it. We present Schema-1, the first DLM: a 140M parameter model trained on more than 2.3M synthetic and real-world tabular datasets. Schema-1 outperforms gradient-boosted ensembles, AutoML stacks, and the tabular foundation models we evaluate on established row-level prediction benchmarks. On missing value reconstruction it achieves lower reconstruction error than all classical statistical methods and frontier large language models on mean performance across conditions, establishing that structural understanding of a dataset's own distributional geometry is more useful for imputation than world knowledge encoded in language. It identifies the industry sector of any unseen dataset from raw cell values alone, reliably across any domain, a task no prior tabular model can perform. It is the native tabular understanding layer that has been missing from the AI stack.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Data Language Models: A New Foundation Model Class for Tabular Data

1. Core Contribution

The paper introduces "Data Language Models" (DLMs) as a formally defined foundation model class for tabular data, characterized by three conditions: multi-signal native ingestion, dataset-level contextual inference (domain identification), and metadata-independent operation. The concrete instantiation, Schema-1, is a 140M parameter model trained on 2.3M datasets. The most genuinely novel contribution is blind dataset sector classification—identifying the industry domain of an unlabeled, metadata-free dataset from distributional structure alone across 10,000 sectors. This task is new and conceptually interesting. The paper also demonstrates strong results on row-level prediction (CC18), missing data robustness, imputation, column-agnostic prediction, and sequential fine-tuning retention.

2. Methodological Rigor

This is where the paper has critical deficiencies. The model architecture is essentially undisclosed. We learn Schema-1 has "four input pathways," 140M parameters, and a "retention component" and "adaptive memory component," but there is no specification of the neural network architecture, loss functions, optimizer, training procedure, or any technical detail sufficient for reproduction. This is a fundamental gap for a scientific paper claiming to introduce a new model class.

Benchmark fairness concerns are substantial:

On CC18, Schema-1 uses fine-tuning mode (backbone frozen, task-specific adaptation on labeled data), while TabPFN operates in zero-shot/in-context learning mode. This comparison conflates model capacity with paradigm differences. The reported jump from 0.9339 to 0.9849 mean ROC-AUC is extraordinary and would benefit from careful analysis of whether fine-tuning on the training folds explains most of the gain.

The imputation benchmark sources competitor numbers from Mangussi et al. [2026]—a reference dated in 2026, raising questions about the provenance and stability of these comparison numbers.

The sequential fine-tuning comparison defines GBDTs as having "0% retention" by construction, making the 97.8% margin trivially inflated.

No error bars, confidence intervals, or statistical significance tests are reported on any benchmark.

The sector classification evaluation, while novel, is entirely self-referential: the taxonomy of 10,000 sectors was designed by the authors, the synthetic training data was generated against this taxonomy, and external validation is impossible without access to the taxonomy or model.

3. Potential Impact

If the claims are reproducible, the practical impact could be significant for enterprise AI workflows. The vision of eliminating preprocessing pipelines for tabular data addresses a real engineering pain point. Domain identification from raw distributional signatures could enable automated data cataloging, privacy classification, and regulatory compliance. The framing of tabular data as a modality deserving native foundation models is reasonable and aligns with calls from van Breugel and van der Schaar [2024].

However, the complete lack of reproducibility severely limits actual scientific impact. No code, no model weights, no architecture details, and no taxonomy specification are provided. Other researchers cannot build on, verify, or extend this work.

4. Timeliness & Relevance

The paper addresses a timely topic. Tabular foundation models are an active research area, and the limitations of existing approaches (cold-start for GBDTs, serialization loss for LLM-based methods, lack of domain understanding in TabPFN-style models) are well-documented. The positioning toward vertical AI and agentic systems reflects genuine industry trends. The formal definition of what distinguishes a DLM from prior approaches is a useful conceptual contribution, even if the three conditions appear tailored to exclude all competitors.

5. Strengths & Limitations

Strengths:

Novel task formulation (blind sector classification) that opens a new evaluation dimension

Comprehensive benchmarking across six evaluation axes covering prediction, robustness, imputation, and continual learning

The column-agnostic evaluation is well-designed and provides genuine insight about semantic vs. structural information

The conceptual argument that structural co-distributional learning outperforms world knowledge for imputation is interesting and supported by the column-agnostic ablation

Limitations:

No architecture disclosure: This alone makes the paper unverifiable as science

Promotional tone: The writing reads as a product announcement rather than a research paper. Phrases like "the missing foundation model" and repeated historical parallels to LLMs/vision models undermine scientific credibility

Comparison fairness: Fine-tuned model vs. zero-shot models; self-defined benchmarks; no recent competitors (TabPFN-2.5 excluded from CC18 because "they haven't published under the same protocol," yet the paper claims superiority over all tabular methods)

No statistical rigor: Zero error bars across all experiments

Synthetic training data: 87% of training data is synthetic, yet the generation procedure is undescribed. The risk of distribution leakage between synthetic training and real-world test sets is unaddressed

Self-serving definition: The three DLM conditions appear reverse-engineered from Schema-1's capabilities to guarantee it is the only system satisfying all three

Additional Observations

The paper conflates multiple distinct capabilities (domain identification, prediction, imputation, continual learning) under a single model class definition, making it difficult to assess which capabilities are genuinely novel versus incremental improvements achieved through scale or fine-tuning. The 2.3M training dataset corpus is substantial, but without understanding what distributional patterns the synthetic data covers, it's impossible to assess whether the model generalizes or memorizes distributional templates. The absence of any open-source component makes this work primarily a commercial announcement wrapped in academic framing.

Rating:4/ 10

Significance 5.5Rigor 2.5Novelty 5Clarity 5.5

Generated May 8, 2026

Comparison History (19)

vs. Imperfect World Models are Exploitable

gemini-3.15/19/2026

Paper 1 introduces a foundation model for tabular data, an incredibly ubiquitous modality in real-world applications. By eliminating preprocessing pipelines and outperforming existing methods like gradient-boosted trees, it offers immense practical utility and broad impact across multiple industries. Paper 2, while offering rigorous theoretical contributions to AI safety and reinforcement learning, has a narrower scope and less immediate real-world applicability compared to revolutionizing tabular data analysis.

vs. Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents

claude-opus-4.65/19/2026

Paper 2 introduces a fundamentally new foundation model class (Data Language Models) for tabular data, which is arguably the most widely used data modality in industry yet lacks native foundation models. This addresses a significant gap in the AI stack with broad applications across virtually every domain that uses structured data. While Paper 1 makes a solid contribution by identifying longitudinal safety risks in memory-equipped LLM agents—an important and timely concern—its scope is narrower, focused on a specific failure mode. Paper 2's potential to reshape how tabular data is consumed across the entire AI ecosystem gives it broader impact potential.

vs. The World Leaks the Future: Harness Evolution for Future Prediction Agents

gpt-5.25/16/2026

Paper 1 proposes a new foundation-model class for tabular data that directly consumes raw tables without serialization/preprocessing, a potentially paradigm-shifting capability for a dominant real-world modality (finance, healthcare, ops). If validated, it could broadly replace/augment standard pipelines and impact many fields. It reports strong benchmark results and novel abilities (dataset sector identification, robust imputation). Paper 2 is timely and useful for agentic forecasting, but is more incremental—an improved harness/training protocol atop existing LLMs—with narrower applicability and potentially higher sensitivity to evaluation design.

vs. Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

gemini-3.15/16/2026

Paper 2 introduces a fundamentally new foundation model class for tabular data, the most ubiquitous data modality in enterprise and scientific research. While Paper 1 addresses a highly timely issue (evaluating production AI agents), Paper 2's Data Language Model natively processes raw tables without preprocessing, claiming to outperform established methods like XGBoost and AutoML. This represents a potential paradigm shift in how structured data is handled across all of AI, offering broader, cross-disciplinary scientific impact and immense real-world utility compared to a domain-specific evaluation framework.

vs. SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory

gpt-5.25/16/2026

Paper 1 introduces a potentially new foundation-model class for tabular data with strong claimed performance across prediction, imputation, and dataset-level understanding directly from raw cells, addressing a ubiquitous, high-stakes modality in industry. If validated, eliminating preprocessing and providing native tabular representations could broadly impact ML systems, AutoML, data engineering, and vertical applications. Paper 2 is timely and useful for agent memory, but graph/RAG memory engines are a crowded space and likely more incremental; impact is narrower to LLM-agent workflows. Overall, Paper 1 appears more novel with wider cross-domain applicability.

vs. Internalizing Safety Understanding in Large Reasoning Models via Verification

gpt-5.25/16/2026

Paper 2 targets a timely, high-stakes problem—LLM safety/alignment—where advances rapidly diffuse across academia and industry. Its central idea (training models to internally verify the safety of their own outputs) is a novel shift from behavioral compliance to intrinsic self-critique, with clear real-world applications (jailbreak robustness, safer deployment) and broad cross-domain relevance. The verification-based framework plus RL initialization claim suggests methodological rigor and extensibility. Paper 1 is innovative for tabular foundation models, but its impact may be narrower and hinges on strong evidence that “no preprocessing” generalizes across messy real-world schemas.

vs. ASMR-Bench: Auditing for Sabotage in ML Research

claude-opus-4.65/16/2026

Paper 1 introduces a fundamentally new foundation model class for tabular data—one of the most widely used data modalities in real-world AI—addressing a long-standing gap in the AI stack. Its broad applicability across prediction, imputation, and dataset understanding, combined with strong empirical results surpassing gradient-boosted ensembles, AutoML, and frontier LLMs, gives it enormous potential impact across many fields. Paper 2 addresses an important but narrower AI safety concern (sabotage detection in ML research), with a benchmark of limited scale (9 codebases) and results showing current methods are inadequate, making it valuable but more niche in scope.

vs. Controllable User Simulation

gemini-3.15/16/2026

Paper 1 introduces a novel foundation model natively designed for tabular data, the most ubiquitous data modality in enterprise and scientific applications. By eliminating preprocessing and outperforming existing SOTA (like gradient-boosted trees and AutoML), it has massive potential for broad, cross-disciplinary real-world impact. While Paper 2 presents rigorous theoretical work on user simulation for conversational agents, its scope is much narrower compared to the universal relevance of a breakthrough in tabular data modeling.

vs. Grounding Clinical AI Competency in Human Cognition Through the Clinical World Model and Skill-Mix Framework

gemini-3.15/16/2026

Paper 1 introduces a novel foundation model architecture natively designed for tabular data, the most ubiquitous data modality in enterprise and scientific applications. By eliminating preprocessing pipelines and demonstrating state-of-the-art performance across diverse tasks, it offers broad, cross-disciplinary utility. In contrast, Paper 2 presents a domain-specific conceptual framework for clinical AI. While valuable for medical AI regulation and evaluation, Paper 1's concrete technological advancement and vast potential for real-world application across nearly all data-driven fields give it significantly higher potential scientific impact.

vs. First-Order Efficiency for Probabilistic Value Estimation via A Statistical Viewpoint

claude-opus-4.65/8/2026

Paper 1 introduces a fundamentally new foundation model class for tabular data—the most common data modality in enterprise AI—addressing a major gap in the foundation model ecosystem. Its breadth of impact is enormous: it eliminates preprocessing pipelines, outperforms established methods (GBDTs, AutoML, tabular FMs, LLMs) across multiple tasks, and enables novel capabilities like dataset-level understanding. Paper 2 makes a solid but narrower theoretical contribution to statistical efficiency of Shapley value estimation. While rigorous, its scope is limited to a specific estimation problem, whereas Paper 1 could reshape how all tabular AI systems are built.

vs. First-Order Efficiency for Probabilistic Value Estimation via A Statistical Viewpoint

gpt-5.25/8/2026

Paper 1 has higher potential impact due to its broad, modality-level framing (a native foundation model for tabular data) and strong real-world applicability across virtually all data-driven domains where tables dominate. If validated, eliminating preprocessing/serialization would be a substantial shift in tabular ML practice and could enable new agentic/vertical AI stacks. Its claims span multiple tasks (prediction, imputation, dataset characterization), suggesting wide downstream leverage. Paper 2 is methodologically rigorous and timely for XAI/data valuation, but it advances a narrower estimation subproblem with more limited cross-field disruption.

vs. SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

claude-opus-4.65/8/2026

Paper 1 introduces a fundamentally new foundation model class for tabular data—a major underserved modality in AI. Its contributions are broad: native tabular understanding without preprocessing, strong benchmarks against GBMs/AutoML/LLMs, and novel capabilities like dataset-level industry identification. The 140M parameter model trained on 2.3M datasets demonstrates significant methodological rigor and addresses a foundational gap in the AI stack. Paper 2 presents an incremental framework for evolving LLM agent skills, which is useful but narrower in scope, building on existing agent paradigms rather than establishing a new model class.

vs. SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

gemini-3.15/8/2026

Paper 1 introduces a native foundation model for tabular data, a ubiquitous and critical modality across nearly all scientific and enterprise domains. By eliminating preprocessing pipelines and outperforming established methods like GBDTs, it offers a fundamental breakthrough in representation learning. While Paper 2 presents a valuable framework for agentic skill evolution, Paper 1 addresses a more foundational and universally applicable problem, promising significantly broader real-world applications and transformative scientific impact across multiple disciplines.

vs. Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

gpt-5.25/8/2026

Paper 1 likely has higher scientific impact: it proposes a new model class (Data Language Models) that natively consumes raw tabular data, claims broad performance gains over strong baselines, and targets a core modality used across high-stakes domains, enabling wide downstream applications (prediction, imputation, dataset understanding). If validated, it could shift tabular ML foundations and infrastructure. Paper 2 is timely and useful as a benchmark for LLM economic behavior, but benchmarks typically have narrower, more incremental impact than a potentially paradigm-shifting modeling approach.

vs. Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

gemini-3.15/8/2026

Paper 1 introduces a novel foundation model for tabular data, arguably the most ubiquitous data modality in enterprise applications. By eliminating preprocessing and outperforming traditional gradient-boosted trees, it represents a potential paradigm shift in data science. Paper 2, while offering a valuable economic benchmark for LLMs, has a narrower scope and focuses on evaluation rather than introducing a fundamental, broadly applicable AI architecture.

vs. DataDignity: Training Data Attribution for Large Language Models

gpt-5.25/8/2026

Paper 2 likely has higher impact: proposing a native “foundation model” for tabular data is a broad, modality-level shift with wide applicability across industry (finance, healthcare, operations) and ML systems, potentially reducing reliance on brittle preprocessing and enabling new agents/workflows. If validated, it could influence benchmarks, tooling, and downstream research across databases, AutoML, and representation learning. Paper 1 is timely and rigorous for LLM auditing/provenance, but its impact is more narrowly scoped to attribution/retrieval and evaluation, whereas DLMs could reshape a larger swath of ML practice.

vs. Saliency-Aware Regularized Quantization Calibration for Large Language Models

gpt-5.25/8/2026

Paper 1 is more novel and potentially foundational: it proposes a new model class (native tabular “Data Language Models”) and demonstrates broad capabilities (prediction, imputation, dataset identification) that could reshape how tabular ML systems are built, with high real-world applicability across many industries relying on tables. Its impact could span ML, data management, and AI product stacks. Paper 2 is timely and practical, but is an incremental improvement within an already-crowded PTQ calibration space, likely yielding narrower, engineering-focused impact.

vs. MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation

claude-opus-4.65/8/2026

Paper 2 introduces a fundamentally new foundation model class for tabular data, addressing a widely recognized gap in the AI ecosystem. Its breadth of impact is much larger—tabular data underpins finance, healthcare, science, and virtually every enterprise domain. The concept of native tabular understanding without preprocessing is a paradigm shift with potential to reshape how all downstream AI systems consume structured data. Paper 1, while valuable for radiology report generation, addresses a narrower application domain and builds incrementally on existing multi-agent and VLM paradigms rather than establishing a new model class.

vs. Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension

gpt-5.25/8/2026

Paper 2 has higher potential impact: it proposes a new foundation-model class for a ubiquitous, high-stakes modality (tabular data) with broad applicability across industries and ML subfields. If the claim of native, preprocessing-free table understanding holds, it could reshape pipelines for prediction, imputation, dataset characterization, and agentic analytics—far beyond a single benchmark. Paper 1 is timely and useful for multimodal multi-agent systems, but its contribution is a protocol/architecture extension with narrower scope and an explicit latency tradeoff; impact depends on adoption of a specific A2A ecosystem.