Antonio Pelusi, Stefano Braghin, Alberto Trombetta
Large language models (LLMs) are increasingly used as conditional generators for structured data, relying on in-context learning (ICL) to adapt to new distributions without parameter updates. We investigate the limits of ICL for structured generation under distribution mismatch, using high-cardinality tabular data as a controlled test case, and identify a structural failure mode we term \textit{categorical prior lock-in}: the inability of ICL to update the model's prior over token distributions inherited from pre-training. Across two 7B-parameter open-weight models, ICL improves numerical fidelity with additional examples but exhibits a sharp ceiling on categorical distributions, failing to reproduce rare classes entirely. Parameter-efficient fine-tuning (LoRA) overcomes these limitations but introduces measurable memorization risk and, in some cases, destabilizes structured output generation, highlighting a fundamental trade-off between adaptability and privacy.
The paper identifies and names "categorical prior lock-in" — the phenomenon where LLMs using in-context learning (ICL) cannot override their pre-training token distributions when generating structured tabular data, particularly for high-cardinality categorical features. The key insight is that ICL improves numerical feature fidelity with more examples but hits a hard ceiling on categorical distributions, with rare categories never being reproduced. The paper contrasts this with LoRA fine-tuning, which overcomes the distributional limitation but introduces memorization risk and, in the case of Mistral-7B, catastrophic structural failures.
The concept is intuitive and well-articulated: ICL operates as local conditioning while categorical distribution matching requires global probability reweighting across the full vocabulary. This is a useful conceptual framing, though the underlying mechanism — that autoregressive models rely on pre-training priors when context is insufficient to override them — is not entirely surprising to the community.
The experimental design is reasonably systematic but has notable limitations:
The paper addresses a practical concern for organizations wanting to use LLMs for synthetic data generation under data residency constraints (motivating the 7B model focus). The finding that ICL cannot handle high-cardinality categoricals is useful for practitioners who might otherwise deploy ICL-based tabular generation pipelines without understanding this limitation.
However, the impact is somewhat circumscribed:
The concept of "categorical prior lock-in" could become useful terminology if adopted, but naming a known phenomenon does not constitute a major scientific contribution.
The paper is timely in the sense that LLM-based tabular data generation is an active area, and understanding ICL's limitations is practically important. The focus on 7B models for on-premise deployment reflects real organizational constraints. However, the rapid scaling of locally deployable models may shift the landscape quickly — what's true at 7B may not hold at 14B or 32B models that are increasingly feasible for local deployment.
The paper also arrives in a context where several works (GReaT, CLLM, EPIC, TabuLa) have already explored LLM-based tabular generation, making the space competitive but the specific diagnostic contribution somewhat incremental.
This is a competent empirical study that identifies a real limitation of ICL for structured generation and presents it clearly. The "categorical prior lock-in" framing is useful but represents an empirical observation of a somewhat expected phenomenon rather than a deep theoretical insight. The single-dataset evaluation and limited model/scale coverage constrain the generalizability of claims. The paper would benefit significantly from testing more datasets, more models (including larger ones), more ICL examples, and proposing concrete mitigations. As it stands, it makes a useful but incremental contribution to the understanding of LLM-based tabular data generation.
Generated Jun 11, 2026
Paper 2 likely has higher impact: it introduces a principled, general framework (Jeffrey guidance) that broadens diffusion-model control beyond standard guidance, with demonstrated gains (FID improvements) and applications to fairness constraints—highly timely and broadly relevant across generative modeling, controllable synthesis, and responsible AI. Paper 1 identifies an important failure mode of ICL for structured data and a privacy–adaptability trade-off, but its scope is narrower (tabular structured generation) and primarily diagnostic rather than enabling new capabilities, with evidence limited to two 7B models.
Paper 2 identifies a fundamental failure mode ('categorical prior lock-in') in LLM in-context learning for structured data, providing mechanistic understanding with broad implications for the growing field of LLM-based data generation. It addresses critical issues of adaptability vs. privacy trade-offs. Paper 1 proposes ART, an interesting but incremental PEFT variant that optimizes visual inputs for MLLMs. While creative, its practical advantages over LoRA are modest, and the 'art stylization' aspect is more aesthetic than scientifically impactful. Paper 2's diagnostic contribution is more likely to influence future research directions.
Paper 2 identifies and characterizes a general failure mode of in-context learning for structured data (“categorical prior lock-in”), with implications for any LLM-based conditional generation under distribution shift. This is novel, timely, and broadly impactful across ML, data synthesis, evaluation, privacy, and deployment, and it frames an important trade-off between adaptability and memorization risk. Paper 1 is practically useful for clinical survival prediction but is more incremental (adapting existing tabular foundation models with a known survival head) and its impact is narrower to survival/tabular transfer learning.
Paper 1 identifies a fundamental failure mode ('categorical prior lock-in') in LLM in-context learning for structured data, which has broad implications for the rapidly growing field of LLM-based data generation and reasoning. This finding is timely given the widespread adoption of LLMs and provides mechanistic insight into ICL limitations that affects multiple downstream applications. Paper 2 addresses a more niche problem (continual anomaly detection with heterogeneous schemas) with a solid but incremental methodological contribution. While practically useful, Paper 1's conceptual contribution about ICL limitations is likely to influence a larger research community and spark further investigation.
Paper 1 identifies a fundamental limitation of LLMs in in-context learning ('categorical prior lock-in'), which has broad implications across numerous domains relying on LLMs for structured data generation. While Paper 2 offers an innovative prompting technique for neural operators, its impact is largely confined to the specialized field of solving partial differential equations (PDEs). Thus, Paper 1 has a higher potential for widespread scientific impact due to the ubiquity of LLMs and tabular data.
Paper 1 challenges standard evaluation practices in a major field (relational learning) and introduces a new geometry-aware benchmarking framework. By exposing systematic biases in flat leaderboards and evaluating a wide range of models, it establishes a foundational protocol that could shift how graph and relational models are assessed. Paper 2 provides valuable insights into LLM failure modes for structured data, but Paper 1's methodological contribution has broader implications for future research rigor and model development across graph machine learning.
Paper 2 identifies a specific, well-characterized failure mode ('categorical prior lock-in') in LLMs that has broad implications for the rapidly growing field of LLM-based structured data generation. It provides actionable insights about fundamental limitations of in-context learning and the trade-offs with fine-tuning, relevant to practitioners across ML, data science, and privacy. Paper 1 introduces an interesting framework (ESBM) for interpretable policy learning, but its scope is narrower (Atari-style games), the approach is more incremental, and the validation setting is limited. Paper 2's findings are more timely and broadly applicable.
Paper 1 identifies a fundamental limitation of In-Context Learning ('categorical prior lock-in') in LLMs. Because LLMs are being adopted across virtually all scientific disciplines, diagnosing their foundational failure modes provides broad, cross-disciplinary impact. While Paper 2 offers a highly valuable applied framework for climate modeling, its contribution is primarily an engineering and scaling achievement using existing U-Net architectures. Paper 1's theoretical insight into LLM behavior is likely to spur wider algorithmic innovations and garner higher citation volume across the rapidly growing AI research community.
Paper 1 introduces a novel and rigorous theoretical framework for online convex optimization under realistic capacity constraints with delayed feedback—a previously unstudied but practically important setting. It provides formal regret guarantees, a new reduction technique, and characterizes fundamental tradeoffs between capacity and delay. Paper 2 identifies an interesting failure mode (categorical prior lock-in) in LLM in-context learning for structured data, but is more empirical and narrower in scope. Paper 1's theoretical contributions have broader applicability across online learning, optimization, and systems with resource constraints, giving it higher long-term impact potential.
Paper 1 likely has higher impact: it identifies a concrete, practically important failure mode of in-context learning for structured/tabular generation (“categorical prior lock-in”), with clear implications for deploying LLMs in data-centric applications. It provides controlled experiments, compares ICL to LoRA, and surfaces an actionable trade-off (adaptability vs privacy/memorization) that spans ML, data management, and responsible AI. Paper 2 is novel and timely in connecting decoding temperature to critical-phenomena-like behavior, but its real-world utility and causal/methodological grounding may be less immediate, making impact more uncertain.