Data Language Models: A New Foundation Model Class for Tabular Data
Eda Erol, Giuliano Pezzoli, Ozer Cem Kelahmet
Abstract
Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. Every approach to tabular AI today, from gradient-boosted trees to the latest tabular foundation models, requires a preprocessing pipeline before any model can consume the data. None of them understand tabular data as a modality. We introduce the Data Language Model (DLM), the missing foundation model for tabular data. A DLM understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values. It is the tabular data layer on which AI models, agents, and vertical AI applications can be built, eliminating the preprocessing pipelines that currently stand between raw data and every AI system that consumes it. We present Schema-1, the first DLM: a 140M parameter model trained on more than 2.3M synthetic and real-world tabular datasets. Schema-1 outperforms gradient-boosted ensembles, AutoML stacks, and the tabular foundation models we evaluate on established row-level prediction benchmarks. On missing value reconstruction it achieves lower reconstruction error than all classical statistical methods and frontier large language models on mean performance across conditions, establishing that structural understanding of a dataset's own distributional geometry is more useful for imputation than world knowledge encoded in language. It identifies the industry sector of any unseen dataset from raw cell values alone, reliably across any domain, a task no prior tabular model can perform. It is the native tabular understanding layer that has been missing from the AI stack.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Data Language Models: A New Foundation Model Class for Tabular Data
1. Core Contribution
The paper introduces "Data Language Models" (DLMs) as a formally defined foundation model class for tabular data, characterized by three conditions: multi-signal native ingestion, dataset-level contextual inference (domain identification), and metadata-independent operation. The concrete instantiation, Schema-1, is a 140M parameter model trained on 2.3M datasets. The most genuinely novel contribution is blind dataset sector classification—identifying the industry domain of an unlabeled, metadata-free dataset from distributional structure alone across 10,000 sectors. This task is new and conceptually interesting. The paper also demonstrates strong results on row-level prediction (CC18), missing data robustness, imputation, column-agnostic prediction, and sequential fine-tuning retention.
2. Methodological Rigor
This is where the paper has critical deficiencies. The model architecture is essentially undisclosed. We learn Schema-1 has "four input pathways," 140M parameters, and a "retention component" and "adaptive memory component," but there is no specification of the neural network architecture, loss functions, optimizer, training procedure, or any technical detail sufficient for reproduction. This is a fundamental gap for a scientific paper claiming to introduce a new model class.
Benchmark fairness concerns are substantial:
The sector classification evaluation, while novel, is entirely self-referential: the taxonomy of 10,000 sectors was designed by the authors, the synthetic training data was generated against this taxonomy, and external validation is impossible without access to the taxonomy or model.
3. Potential Impact
If the claims are reproducible, the practical impact could be significant for enterprise AI workflows. The vision of eliminating preprocessing pipelines for tabular data addresses a real engineering pain point. Domain identification from raw distributional signatures could enable automated data cataloging, privacy classification, and regulatory compliance. The framing of tabular data as a modality deserving native foundation models is reasonable and aligns with calls from van Breugel and van der Schaar [2024].
However, the complete lack of reproducibility severely limits actual scientific impact. No code, no model weights, no architecture details, and no taxonomy specification are provided. Other researchers cannot build on, verify, or extend this work.
4. Timeliness & Relevance
The paper addresses a timely topic. Tabular foundation models are an active research area, and the limitations of existing approaches (cold-start for GBDTs, serialization loss for LLM-based methods, lack of domain understanding in TabPFN-style models) are well-documented. The positioning toward vertical AI and agentic systems reflects genuine industry trends. The formal definition of what distinguishes a DLM from prior approaches is a useful conceptual contribution, even if the three conditions appear tailored to exclude all competitors.
5. Strengths & Limitations
Strengths:
Limitations:
Additional Observations
The paper conflates multiple distinct capabilities (domain identification, prediction, imputation, continual learning) under a single model class definition, making it difficult to assess which capabilities are genuinely novel versus incremental improvements achieved through scale or fine-tuning. The 2.3M training dataset corpus is substantial, but without understanding what distributional patterns the synthetic data covers, it's impossible to assess whether the model generalizes or memorizes distributional templates. The absence of any open-source component makes this work primarily a commercial announcement wrapped in academic framing.
Generated May 8, 2026
Comparison History (19)
Paper 1 introduces a foundation model for tabular data, an incredibly ubiquitous modality in real-world applications. By eliminating preprocessing pipelines and outperforming existing methods like gradient-boosted trees, it offers immense practical utility and broad impact across multiple industries. Paper 2, while offering rigorous theoretical contributions to AI safety and reinforcement learning, has a narrower scope and less immediate real-world applicability compared to revolutionizing tabular data analysis.
Paper 2 introduces a fundamentally new foundation model class (Data Language Models) for tabular data, which is arguably the most widely used data modality in industry yet lacks native foundation models. This addresses a significant gap in the AI stack with broad applications across virtually every domain that uses structured data. While Paper 1 makes a solid contribution by identifying longitudinal safety risks in memory-equipped LLM agents—an important and timely concern—its scope is narrower, focused on a specific failure mode. Paper 2's potential to reshape how tabular data is consumed across the entire AI ecosystem gives it broader impact potential.
Paper 1 proposes a new foundation-model class for tabular data that directly consumes raw tables without serialization/preprocessing, a potentially paradigm-shifting capability for a dominant real-world modality (finance, healthcare, ops). If validated, it could broadly replace/augment standard pipelines and impact many fields. It reports strong benchmark results and novel abilities (dataset sector identification, robust imputation). Paper 2 is timely and useful for agentic forecasting, but is more incremental—an improved harness/training protocol atop existing LLMs—with narrower applicability and potentially higher sensitivity to evaluation design.
Paper 2 introduces a fundamentally new foundation model class for tabular data, the most ubiquitous data modality in enterprise and scientific research. While Paper 1 addresses a highly timely issue (evaluating production AI agents), Paper 2's Data Language Model natively processes raw tables without preprocessing, claiming to outperform established methods like XGBoost and AutoML. This represents a potential paradigm shift in how structured data is handled across all of AI, offering broader, cross-disciplinary scientific impact and immense real-world utility compared to a domain-specific evaluation framework.
Paper 1 introduces a potentially new foundation-model class for tabular data with strong claimed performance across prediction, imputation, and dataset-level understanding directly from raw cells, addressing a ubiquitous, high-stakes modality in industry. If validated, eliminating preprocessing and providing native tabular representations could broadly impact ML systems, AutoML, data engineering, and vertical applications. Paper 2 is timely and useful for agent memory, but graph/RAG memory engines are a crowded space and likely more incremental; impact is narrower to LLM-agent workflows. Overall, Paper 1 appears more novel with wider cross-domain applicability.
Paper 2 targets a timely, high-stakes problem—LLM safety/alignment—where advances rapidly diffuse across academia and industry. Its central idea (training models to internally verify the safety of their own outputs) is a novel shift from behavioral compliance to intrinsic self-critique, with clear real-world applications (jailbreak robustness, safer deployment) and broad cross-domain relevance. The verification-based framework plus RL initialization claim suggests methodological rigor and extensibility. Paper 1 is innovative for tabular foundation models, but its impact may be narrower and hinges on strong evidence that “no preprocessing” generalizes across messy real-world schemas.
Paper 1 introduces a fundamentally new foundation model class for tabular data—one of the most widely used data modalities in real-world AI—addressing a long-standing gap in the AI stack. Its broad applicability across prediction, imputation, and dataset understanding, combined with strong empirical results surpassing gradient-boosted ensembles, AutoML, and frontier LLMs, gives it enormous potential impact across many fields. Paper 2 addresses an important but narrower AI safety concern (sabotage detection in ML research), with a benchmark of limited scale (9 codebases) and results showing current methods are inadequate, making it valuable but more niche in scope.
Paper 1 introduces a novel foundation model natively designed for tabular data, the most ubiquitous data modality in enterprise and scientific applications. By eliminating preprocessing and outperforming existing SOTA (like gradient-boosted trees and AutoML), it has massive potential for broad, cross-disciplinary real-world impact. While Paper 2 presents rigorous theoretical work on user simulation for conversational agents, its scope is much narrower compared to the universal relevance of a breakthrough in tabular data modeling.
Paper 1 introduces a novel foundation model architecture natively designed for tabular data, the most ubiquitous data modality in enterprise and scientific applications. By eliminating preprocessing pipelines and demonstrating state-of-the-art performance across diverse tasks, it offers broad, cross-disciplinary utility. In contrast, Paper 2 presents a domain-specific conceptual framework for clinical AI. While valuable for medical AI regulation and evaluation, Paper 1's concrete technological advancement and vast potential for real-world application across nearly all data-driven fields give it significantly higher potential scientific impact.
Paper 1 introduces a fundamentally new foundation model class for tabular data—the most common data modality in enterprise AI—addressing a major gap in the foundation model ecosystem. Its breadth of impact is enormous: it eliminates preprocessing pipelines, outperforms established methods (GBDTs, AutoML, tabular FMs, LLMs) across multiple tasks, and enables novel capabilities like dataset-level understanding. Paper 2 makes a solid but narrower theoretical contribution to statistical efficiency of Shapley value estimation. While rigorous, its scope is limited to a specific estimation problem, whereas Paper 1 could reshape how all tabular AI systems are built.
Paper 1 has higher potential impact due to its broad, modality-level framing (a native foundation model for tabular data) and strong real-world applicability across virtually all data-driven domains where tables dominate. If validated, eliminating preprocessing/serialization would be a substantial shift in tabular ML practice and could enable new agentic/vertical AI stacks. Its claims span multiple tasks (prediction, imputation, dataset characterization), suggesting wide downstream leverage. Paper 2 is methodologically rigorous and timely for XAI/data valuation, but it advances a narrower estimation subproblem with more limited cross-field disruption.
Paper 1 introduces a fundamentally new foundation model class for tabular data—a major underserved modality in AI. Its contributions are broad: native tabular understanding without preprocessing, strong benchmarks against GBMs/AutoML/LLMs, and novel capabilities like dataset-level industry identification. The 140M parameter model trained on 2.3M datasets demonstrates significant methodological rigor and addresses a foundational gap in the AI stack. Paper 2 presents an incremental framework for evolving LLM agent skills, which is useful but narrower in scope, building on existing agent paradigms rather than establishing a new model class.
Paper 1 introduces a native foundation model for tabular data, a ubiquitous and critical modality across nearly all scientific and enterprise domains. By eliminating preprocessing pipelines and outperforming established methods like GBDTs, it offers a fundamental breakthrough in representation learning. While Paper 2 presents a valuable framework for agentic skill evolution, Paper 1 addresses a more foundational and universally applicable problem, promising significantly broader real-world applications and transformative scientific impact across multiple disciplines.
Paper 1 likely has higher scientific impact: it proposes a new model class (Data Language Models) that natively consumes raw tabular data, claims broad performance gains over strong baselines, and targets a core modality used across high-stakes domains, enabling wide downstream applications (prediction, imputation, dataset understanding). If validated, it could shift tabular ML foundations and infrastructure. Paper 2 is timely and useful as a benchmark for LLM economic behavior, but benchmarks typically have narrower, more incremental impact than a potentially paradigm-shifting modeling approach.
Paper 1 introduces a novel foundation model for tabular data, arguably the most ubiquitous data modality in enterprise applications. By eliminating preprocessing and outperforming traditional gradient-boosted trees, it represents a potential paradigm shift in data science. Paper 2, while offering a valuable economic benchmark for LLMs, has a narrower scope and focuses on evaluation rather than introducing a fundamental, broadly applicable AI architecture.
Paper 2 likely has higher impact: proposing a native “foundation model” for tabular data is a broad, modality-level shift with wide applicability across industry (finance, healthcare, operations) and ML systems, potentially reducing reliance on brittle preprocessing and enabling new agents/workflows. If validated, it could influence benchmarks, tooling, and downstream research across databases, AutoML, and representation learning. Paper 1 is timely and rigorous for LLM auditing/provenance, but its impact is more narrowly scoped to attribution/retrieval and evaluation, whereas DLMs could reshape a larger swath of ML practice.
Paper 1 is more novel and potentially foundational: it proposes a new model class (native tabular “Data Language Models”) and demonstrates broad capabilities (prediction, imputation, dataset identification) that could reshape how tabular ML systems are built, with high real-world applicability across many industries relying on tables. Its impact could span ML, data management, and AI product stacks. Paper 2 is timely and practical, but is an incremental improvement within an already-crowded PTQ calibration space, likely yielding narrower, engineering-focused impact.
Paper 2 introduces a fundamentally new foundation model class for tabular data, addressing a widely recognized gap in the AI ecosystem. Its breadth of impact is much larger—tabular data underpins finance, healthcare, science, and virtually every enterprise domain. The concept of native tabular understanding without preprocessing is a paradigm shift with potential to reshape how all downstream AI systems consume structured data. Paper 1, while valuable for radiology report generation, addresses a narrower application domain and builds incrementally on existing multi-agent and VLM paradigms rather than establishing a new model class.
Paper 2 has higher potential impact: it proposes a new foundation-model class for a ubiquitous, high-stakes modality (tabular data) with broad applicability across industries and ML subfields. If the claim of native, preprocessing-free table understanding holds, it could reshape pipelines for prediction, imputation, dataset characterization, and agentic analytics—far beyond a single benchmark. Paper 1 is timely and useful for multimodal multi-agent systems, but its contribution is a protocol/architecture extension with narrower scope and an explicit latency tradeoff; impact depends on adoption of a specific A2A ecosystem.