To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

Nevena Lazić, Liam Fowl, András György, Csaba Szepesvári

#127 of 2292 · Artificial Intelligence
Share
Tournament Score
1535±33
10501800
71%
Win Rate
29
Wins
12
Losses
41
Matches
Rating
6.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reasoning problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that were not observed during training, and it was shown that one reason behind this is the difficulty of copying (or generating) unseen tokens. We show both theoretically and empirically that a particular representational collapse also has a crucial role: the unembeddings (last-layer weights) of unseen tokens collapse to nearly the same vector during training. The collapse makes distinguishing multiple unseen variables difficult for the model (especially when the embedding and unembedding parameters are shared), and provides a mechanistic explanation for the effectiveness of existing heuristic interventions like "active forgetting", which periodically reset the token (un)embeddings. Based on these observations, we devise a combination of techniques, involving a small architecture change facilitating copying, data diversity, and freezing or resetting (un)embeddings, that achieves generalization to unseen tokens. We support our claims with extensive controlled experiments on propositional logic reasoning problems. Beyond synthetic experiments, we also observe evidence of (un)embedding collapse in the open-weight models in the Gemma 3 family, which includes 99 unused tokens reserved for downstream use. Empirically we find that the correlated embeddings of these tokens are a poor initialization for finetuning applications.

AI Impact Assessments

(3 models)

Scientific Impact Assessment

Core Contribution

This paper identifies and analyzes a previously overlooked mechanism — unembedding collapse — that explains why transformers fail to generalize symbolic reasoning to unseen tokens. Specifically, the authors demonstrate that during training, the unembedding vectors of tokens not present in the training data converge toward nearly identical vectors, making it impossible for the model to distinguish between multiple unseen symbols used as different variables. This provides a mechanistic explanation for previously observed failures and for the empirical success of heuristic interventions like "active forgetting" (Anand et al., 2025).

The paper's practical contribution is a combination of techniques — copy attention heads, high symbolic diversity in training data, and frozen/periodically reset embeddings — that together enable generalization to problems with entirely unseen token vocabularies. The authors also bridge from synthetic experiments to real models by demonstrating embedding collapse in the Gemma 3 family's 99 reserved unused tokens.

Methodological Rigor

Theoretical analysis. Lemma 4.1 is clean and well-constructed: it shows that under ℓ₂-regularized GD/SGD with bounded features (guaranteed by layernorm), the unembedding vectors of unseen labels contract toward each other at each step, provided the model assigns low probability to those labels on training data. The proof leverages the Lagrange mean-value theorem in a straightforward but effective way. The bound is tight enough to be informative — the contraction rate depends on the regularization strength λ, the feature norm bound r, and the maximum probability assigned to unseen classes. However, the analysis covers SGD with weight decay, while practice uses AdamW; the authors acknowledge this gap and provide empirical evidence that collapse occurs under AdamW as well.

Experimental design. The synthetic propositional logic testbed is well-controlled, enabling systematic ablation across multiple axes: vocabulary size (|P| ∈ {150, 600, 2400, 9600}), architecture (with/without copy attention), and embedding treatment (trainable/frozen/periodically reset). The evaluation distinguishes three regimes of increasing difficulty: all seen tokens, unseen query only, and all predicates unseen. This granularity is crucial for understanding failure modes.

A minor weakness is that the experimental evaluation is entirely on propositional logic problems with small models (~60M parameters). While the logic task is representative of symbolic reasoning, the generality of the findings to other reasoning domains remains unverified beyond the Gemma analysis.

Potential Impact

Mechanistic understanding. The identification of unembedding collapse as a systematic failure mode is valuable for the broader community studying transformer reasoning. It connects to and explains several previously disconnected observations: the transience of in-context learning, the effectiveness of active forgetting, and why copy mechanisms alone are insufficient with multiple unseen variables.

Practical implications for model design. The finding that Gemma 3's unused tokens have collapsed embeddings (with cosine similarities of 0.78 for the 1B model) and that finetuning with these tokens is ~10x slower than with well-separated tokens is directly actionable. Model developers who reserve tokens for downstream use should be aware that these tokens may require special initialization strategies.

Copy attention architecture. The proposed copy attention mechanism, while building on Boix-Adserà et al. (2024), is a practical and lightweight modification (separate query/key parameters for copying, bypassing MLPs) that shows slight improvements even on standard language modeling (C4 experiments). This could influence architecture design for tasks requiring faithful reproduction of input symbols.

Limitations for broader impact. The solutions proposed (freezing/resetting embeddings) are acknowledged as impractical for general-purpose models. Temporary active forgetting combined with copy attention is more promising but still requires careful tuning of hyperparameters (k₁, k₂). The paper does not demonstrate these techniques at scale.

Timeliness & Relevance

This work is highly timely. The question of whether LLMs perform genuine reasoning versus pattern matching is central to current AI research. The specific focus on abstract symbolic reasoning — where models must operate on structure rather than content — directly addresses concerns raised by recent work on token bias and fragility of LLM reasoning (Jiang et al., 2024; Mirzadeh et al., 2024). The connection to reserved/unused tokens in production models (Gemma 3) adds immediate practical relevance.

Strengths

1. Clear mechanistic insight: The unembedding collapse phenomenon is well-characterized both theoretically and empirically, providing a principled explanation for previously mysterious failures.

2. Comprehensive experimental design: The systematic ablation across architecture variants, embedding strategies, and diversity levels provides a thorough understanding of the interaction between these factors.

3. Theory-practice bridge: The Gemma 3 experiments demonstrate that the synthetic findings have real-world implications, strengthening the paper's relevance.

4. Explanatory power: The framework explains why existing interventions (active forgetting, copy mechanisms) work partially and why they fail in combination with multiple unseen variables.

5. Multi-token symbol analysis (Appendix F): The preliminary study on multi-token variables reveals additional failure modes (truncation, prefix confusion, hallucination loops) that open interesting future directions.

Limitations

1. Scale gap: The synthetic experiments use 60M parameter models. Whether the proposed solutions (especially copy attention + temporary active forgetting) scale to larger models remains unknown.

2. Task specificity: Only propositional logic is studied in depth. Generalization to other symbolic reasoning domains (e.g., algebra, program synthesis) is not demonstrated.

3. AdamW theory gap: The theoretical analysis covers SGD but not AdamW, which is the optimizer actually used. The adaptive learning rates in Adam could potentially counteract or accelerate collapse differently.

4. Frozen embeddings impracticality: The most reliable solution (freezing embeddings) significantly degrades language modeling performance (Figure 10), limiting its applicability. The practical recommendation remains somewhat unclear.

5. Gemma finetuning experiment: While showing that unused tokens slow finetuning, the experiment doesn't demonstrate a fix. Simply reinitializing the collapsed embeddings before finetuning is the obvious intervention but isn't tested.

Overall Assessment

This paper makes a solid mechanistic contribution to understanding transformer failures in symbolic reasoning. The unembedding collapse phenomenon is clearly identified, theoretically grounded (with caveats), and empirically validated. The connection to real models (Gemma 3) elevates the work beyond purely synthetic analysis. While the proposed solutions are not yet fully practical for large-scale models, the diagnostic insights are valuable and the experimental methodology is thorough.

Rating:6.8/ 10
Significance 7Rigor 7Novelty 7Clarity 8

Generated Apr 24, 2026

Comparison History (41)

vs. Learning to Draw ASCII Improves Spatial Reasoning in Language Models
gpt-5.25/5/2026

Paper 2 offers a mechanistic explanation (unembedding collapse for unseen tokens) with both theoretical and empirical support, and proposes targeted interventions (architecture tweak, data diversity, freezing/resetting embeddings) that improve out-of-distribution generalization. Its findings are broadly relevant to transformer training, tokenization, parameter tying, and finetuning practice, with demonstrated relevance to real open-weight models (Gemma 3 reserved tokens). Paper 1 is novel and useful for spatial reasoning, but its impact is narrower and more task-specific, with less general mechanistic insight into transformer failures.

vs. Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement
gpt-5.25/5/2026

Paper 2 likely has higher impact: it tackles a high-value real-world problem (industrial RTL timing/PPA optimization) with a realistic evaluation setup (commercial EDA workflow, 20 real designs) and reports substantial gains over a leading tool, suggesting immediate practical relevance and adoption potential. Its agentic closed-loop framework plus reusable skill library could generalize to broader EDA automation and tool-grounded self-improving agents. Paper 1 is methodologically strong and novel mechanistically, but is narrower (symbolic reasoning generalization and token embedding dynamics) with less direct near-term application breadth.

vs. Knows: Agent-Native Structured Research Representations
gemini-35/5/2026

Paper 2 addresses a fundamental limitation of transformer models (generalization in symbolic reasoning with unseen tokens) by providing both theoretical insights into representational collapse and practical architectural fixes. While Paper 1 offers a highly practical tool for agent workflows, Paper 2 has a broader foundational impact on how large language models are trained and architecture-designed, potentially influencing the next generation of reasoning models across all domains.

vs. WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
claude-opus-4.65/1/2026

Paper 1 provides deep theoretical and empirical insights into a fundamental limitation of transformers—representational collapse of unseen token embeddings—with mechanistic explanations and principled solutions. This has broad implications for understanding and improving transformer generalization, applicable beyond symbolic reasoning to any setting with rare/unseen tokens, including finetuning of large language models (as demonstrated with Gemma 3). Paper 2, while practically useful, is primarily a benchmark contribution for GUI agents—a narrower, more applied domain—and its findings (agents struggle with multi-app tasks) are somewhat expected, limiting its novelty and breadth of impact.

vs. WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
gpt-5.25/1/2026

Paper 2 offers a novel mechanistic explanation (unembedding/embedding collapse for unseen tokens) for a known generalization failure in transformers, supported by theory plus controlled experiments and validated on real open-weight models (Gemma 3). It proposes actionable, broadly applicable interventions (architecture tweak, data diversity, freezing/resetting embeddings) with implications for symbolic reasoning, in-context learning, tokenizer design, and finetuning practice. Paper 1 is a valuable benchmark for GUI agents with clear applications, but benchmark papers typically have narrower methodological novelty and impact unless they become a dominant standard; Paper 2’s insight is more likely to transfer across many ML settings.

vs. Sound Agentic Science Requires Adversarial Experiments
claude-opus-4.64/27/2026

Paper 2 identifies a concrete, mechanistic phenomenon (representational collapse of unseen token embeddings) with both theoretical and empirical grounding, proposes actionable solutions, and demonstrates real-world relevance in production models (Gemma 3). This has broad implications for transformer architecture design, fine-tuning practices, and understanding generalization limits. Paper 1 raises important methodological concerns about LLM agents in science but is primarily a position/opinion piece proposing a philosophical standard (falsification-first) without novel empirical contributions or concrete implementations, limiting its direct scientific impact.

vs. Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows
gemini-34/26/2026

Paper 2 addresses a fundamental mechanism in transformer architectures (representational collapse), offering both theoretical and empirical validation, which broadly impacts model design and fine-tuning. In contrast, Paper 1 focuses on a specific, highly applied middleware optimization for LLM agents and relies heavily on simulated token counts and projected outcomes rather than live, end-to-end empirical measurements, limiting its methodological rigor and broader scientific impact.

vs. Mind the Prompt: Self-adaptive Generation of Task Plan Explanations via LLMs
gpt-5.24/26/2026

Paper 2 offers a more fundamental, mechanistic contribution: identifying and analyzing (un)embedding representational collapse as a key barrier to transformer generalization in symbolic reasoning, with both theoretical framing and extensive controlled experiments plus evidence in real open-weight models (Gemma 3). Its proposed interventions (architecture tweak, data diversity, freezing/resetting embeddings) are broadly actionable for LLM training, interpretability, and downstream finetuning. Paper 1 is novel for adaptive prompt synthesis via POMDPs and has practical HCI/system integration value, but is more domain-specific and likely less broadly influential across ML theory and model development.

vs. Mind the Prompt: Self-adaptive Generation of Task Plan Explanations via LLMs
claude-opus-4.64/26/2026

Paper 2 addresses a fundamental limitation of transformer models—generalization to unseen tokens in symbolic reasoning—with both theoretical analysis and practical solutions. The identification of representational collapse in unembeddings provides mechanistic understanding applicable beyond the specific task, with evidence in production models (Gemma 3). This has broad implications for transfer learning, finetuning, and transformer architecture design. Paper 1, while addressing an interesting application of LLMs for explainability via adaptive prompt engineering, is more narrowly scoped as a proof-of-concept system with less generalizable theoretical contributions.

vs. HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference
gemini-34/26/2026

Paper 1 investigates the fundamental mechanisms of generalization and representational collapse in transformers during symbolic reasoning. It provides both theoretical and empirical insights into model behavior, which has broad implications for understanding and improving transformer architectures. While Paper 2 offers a highly practical and effective engineering solution for MLLM efficiency, Paper 1's contribution to the foundational scientific understanding of how LLMs learn and fail gives it a higher potential for broad scientific impact across deep learning research.

vs. Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
claude-opus-4.64/26/2026

Paper 2 addresses a fundamental limitation of transformer architectures—representational collapse of unseen token embeddings—with both theoretical analysis and practical solutions. This has broad implications across all transformer applications involving novel or rare tokens, including finetuning and domain adaptation. The mechanistic understanding it provides (validated in production models like Gemma 3) is more foundational than Paper 1's engineering-focused forecasting system. While Paper 1 achieves strong benchmark results, its contributions are more incremental and domain-specific, whereas Paper 2's insights about embedding collapse generalize across the field.

vs. Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs
gemini-34/26/2026

Paper 1 addresses a critical and highly timely issue in AI safety—jailbreaking Vision-Language Models. By moving beyond surface-level pixel attacks to exploit deep semantic structures in natural images, it introduces a novel attack vector. The release of a massive dataset (MemJack-Bench) with over 113,000 trajectories provides a highly valuable resource that will likely catalyze significant future research in multimodal defensive alignment, leading to high citations and immediate real-world security applications.

vs. Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
gemini-34/26/2026

Paper 2 provides fundamental mechanistic insights into transformer generalization and representational collapse. By offering architectural solutions and directly addressing practical finetuning issues in state-of-the-art models like Gemma 3, it promises broader impact across the foundational AI community. Paper 1, while valuable, focuses on a more specific diagnostic probe for abstract visual reasoning benchmarks, which has narrower implications.

vs. ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
gpt-5.24/26/2026

Paper 2 likely has higher scientific impact due to a clearer, more general mechanistic finding about transformer training dynamics: (un)embedding representational collapse for unseen tokens, supported by theory, controlled experiments, and evidence in widely used open models (Gemma 3). This insight applies broadly across NLP/LLMs (generalization, tokenization, finetuning, weight tying) and offers actionable interventions with relevance to many downstream tasks. Paper 1 is valuable for embodied VLA robustness, but its techniques are more domain-specific and incremental within agent-correction/alignment literature, with narrower cross-field reach.

vs. Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning
claude-opus-4.64/26/2026

Paper 2 identifies a fundamental mechanistic issue (representational collapse of unseen token embeddings) in transformers that affects generalization broadly, provides both theoretical and empirical analysis, proposes actionable solutions, and validates findings in real open-weight models (Gemma 3). This has broader implications for transformer architecture design, fine-tuning practices, and understanding of generalization. Paper 1, while valuable for AI safety and formal verification, addresses a narrower question about LLM formalization gaming with primarily empirical/negative results, limiting its broader impact.

vs. Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction
gemini-34/26/2026

Paper 2 investigates fundamental mechanistic limitations of transformers in symbolic reasoning, offering both theoretical insights and empirical solutions to representational collapse. Its findings on token generalization have broad implications across the entire field of large language models. In contrast, Paper 1 is an applied systems-integration paper proposing a framework for tele-rehabilitation, which, while practically useful, lacks the theoretical depth and foundational breadth of impact of Paper 2. Additionally, Paper 1 describes an incomplete clinical evaluation.

vs. Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction
gemini-34/26/2026

Paper 1 addresses a fundamental limitation of transformer models (generalization in symbolic reasoning) by providing theoretical and empirical evidence of representational collapse. Because it targets core architectural and mechanistic understanding of LLMs, including relevance to state-of-the-art models like Gemma 3, its findings can broadly impact the foundation model research community. In contrast, Paper 2 presents an applied framework integrating existing technologies (LLMs, MediaPipe) for a specific medical use case, and only outlines a clinical evaluation plan without finalized empirical results, limiting its immediate scientific impact compared to foundational ML research.

vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
gemini-34/26/2026

Paper 2 addresses a critical, system-wide bottleneck in the scientific process (peer review) with a massive, real-world deployment on over 22,000 papers. Its demonstration that AI can assist or even outperform human reviews on technical accuracy has profound implications for how scientific research is evaluated across all disciplines. While Paper 1 offers valuable mechanistic insights for transformer architectures, Paper 2's successful large-scale field trial and immediate applicability to the broader scientific enterprise give it a substantially higher breadth of impact and real-world significance.

vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
gemini-34/26/2026

Paper 1 addresses a critical, real-world consequence of AI safety alignment: iatrogenic harm through information withholding in medical crises. Its pre-registered, physician-validated methodology exposes a major flaw in current LLM safety paradigms. This has profound, immediate implications across AI policy, healthcare, and alignment research, offering broader societal and scientific impact than Paper 2's narrower, though methodologically sound, focus on Transformer token embedding collapse.

vs. AI scientists produce results without reasoning scientifically
claude-opus-4.64/26/2026

Paper 1 addresses a critical and timely question about AI-driven scientific discovery, demonstrating through large-scale evaluation (25,000+ runs across 8 domains) that LLM-based scientific agents fail to exhibit genuine scientific reasoning despite producing correct-seeming outputs. This has broad implications for the rapidly growing field of autonomous AI science, AI safety, and science policy. Its finding that scaffold engineering cannot fix fundamental reasoning deficits, and that outcome-based evaluation is insufficient, will influence how the community develops and evaluates AI scientific agents. Paper 2, while technically rigorous and mechanistically insightful regarding transformer generalization on symbolic reasoning, addresses a narrower problem with more limited real-world impact.