Pietro Barbiero, Giovanni De Felice, Mateo Espinosa Zarlenga, Francesco Giannini, Filippo Bonchi, Mateja Jamnik, Giuseppe Marra, Ruggero Noris
As Artificial Intelligence models grow in complexity, interpretability has become an indispensable tool for understanding, debugging, and controlling their computations. However, interpretability lacks general theories to deductively design interpretable methods. This gap between theories and methods results in a fragmented literature and inconsistent evaluation protocols. To fill this gap, we introduce the Standard Interpretable Model (SIM), a general theory grounded in Lagrangian mechanics that enables the deductive design of interpretable methods. Specifically, the SIM summarises, in a set of premises, what interpretability is for a target user. From these premises, the SIM systematically derives interpretability symmetries and corresponding constraints, which shape the landscape of a Lagrangian whose minima correspond to optimal interpretable models. To reach the minima, one can either update the parameter values of an opaque model to make it more interpretable or compile constraints into an interpretable architecture. We empirically show that the SIM identifies and solves limitations of existing methods (including traditional, concept-based, and mechanistic interpretability), highlights underexplored research directions, and informs the design of core programming interfaces. Beyond being a research method, the deductive nature of the SIM offers pedagogical grounding for interpretability curricula and may shift the scientific community's perspective of a discipline that has long been fragmented.
The paper proposes the Standard Interpretable Model (SIM), a framework that formalizes interpretability through the lens of Lagrangian mechanics and symmetry-based reasoning. The central idea is a six-step pipeline: (1) state interpretability premises relative to a target user, (2) derive symmetries from these premises, (3) translate symmetries into measurable constraints, (4) construct a Lagrangian whose minima correspond to optimal interpretable models, (5) derive parameter update rules via the principle of least action, and (6) compile constraints directly into model architectures.
The paper instantiates this framework for "bounded and formal entities" — users with a fixed vocabulary, semantic mappings, and bounded computation — yielding three specific symmetries: concept invariance under monotonic maps, model invariance under concept projection, and hypothesis invariance under composition transformation. These map onto concrete constraints, architectures, and loss functions that subsume several existing interpretable model families (CBMs, prototypical networks, NAMs, SAEs, decision trees) as special cases.
Strengths in formalization: The symmetry-to-constraint derivations are mathematically precise, with formal lemmas and proofs provided for each constraint (Appendices A-G). The use of preorder-preserving maps for concept semantics (Symmetry I) is elegant and addresses a genuine subtlety — that interpretability requires preserving ordinal relationships rather than exact numerical values.
Concerns about the Lagrangian framing: While the Lagrangian mechanics language provides a unifying notation, its necessity is debatable. The kinetic term T is essentially a momentum regularizer on parameter updates, and the potential V is a standard constrained optimization objective. The resulting update rule (gradient descent with momentum) is well-known and does not require Lagrangian mechanics to derive. The framework adds notational overhead that may obscure rather than illuminate for practitioners. The analogy to physics is suggestive but does not yield genuinely new optimization insights — unlike, say, how symmetries in physics lead to conservation laws via Noether's theorem (which is referenced but not meaningfully exploited).
Empirical validation: The controlled experiments (Figures 5-8) effectively demonstrate the distinction between architectural compilation and optimization-based enforcement of symmetries. The finding that MAE is a poor proxy for concept semantics preservation (Figure 5) is well-illustrated. However, the experiments are limited in scale and scope — the controlled settings use toy data, and the large-scale experiments are observational rather than interventional.
Large-scale analysis: The VLM concept semantics experiment (Figure 9) is insightful but narrow — testing only one concept ("red") on MNIST-derived images. The Steerling analysis (Figure 11) provides useful empirical characterization but relies on approximations (effective rank, proxy for f). The chain-of-thought experiment makes a strong claim (CoT is not interpretable under Symmetry II) from a single model and limited evaluation.
Unifying framework: The paper's most valuable contribution is arguably Table 6, which systematically categorizes existing interpretable methods according to which symmetries they satisfy, revealing gaps (e.g., SAEs lacking semantic grounding, feature attribution lacking bounded reasoning constraints). This comparative analysis is actionable and could guide future method development.
Software contribution: The PyTorch Concepts library provides a concrete implementation path, though its impact depends on adoption and the maturity of the codebase.
Pedagogical value: The claim that SIM provides pedagogical grounding for interpretability curricula is plausible — the progression from premises to symmetries to constraints to architectures is didactically clean.
The paper addresses a genuine need. Interpretability research is indeed fragmented, with methods often motivated by intuition rather than derived from first principles. The timing is appropriate given the scaling of AI systems and regulatory pressure (EU AI Act). The analysis of large-scale concept-based models (Steerling) and VLMs addresses current practical concerns.
Missing elements: No user studies validate that models satisfying these symmetries are actually more interpretable to humans. The paper's epistemic falsifiability criterion acknowledges this but defers it entirely. The relationship to causal interpretability frameworks is underdeveloped.
This is an ambitious unifying paper that provides genuine organizational value to the interpretability field. Its strongest contribution is the systematic derivation pipeline and the comparative analysis it enables. However, the Lagrangian mechanics framing adds more notational complexity than genuine insight, and the empirical validation does not match the theoretical ambition. The framework is more useful as a taxonomic and analytical tool than as a practical method for building new interpretable systems.
Generated Jun 11, 2026
Paper 1 introduces a unifying theoretical framework (SIM) for interpretable ML grounded in Lagrangian mechanics, addressing a fundamental gap in a critically important and rapidly growing field. Its breadth of impact is enormous—it spans traditional, concept-based, and mechanistic interpretability, offers pedagogical value, and provides a deductive methodology applicable across the entire interpretability discipline. Paper 2, while technically rigorous with novel certified horizon guarantees for equivariant world models, addresses a narrower problem (multi-step prediction trust in equivariant models). Paper 1's potential to restructure and unify a fragmented field gives it broader and deeper long-term scientific impact.
Paper 2 proposes a general unifying theory (SIM) for interpretable machine learning grounded in Lagrangian mechanics, which addresses a fundamental gap across the entire interpretability field. Its breadth of impact spans traditional, concept-based, and mechanistic interpretability, offering both theoretical foundations and practical design principles. While Paper 1 makes solid contributions to latent reasoning with RL and mechanistic analysis, it addresses a narrower problem. Paper 2's potential to unify a fragmented discipline, inform curricula, and reshape how interpretability methods are designed gives it broader and longer-lasting scientific impact.
Paper 2 is likely to have higher impact due to its timeliness and direct applicability to a central, fast-moving problem: how RL post-training improves reasoning in LLMs. It provides concrete, empirically grounded mechanisms (strategy selection/improvement) and actionable training interventions, making it readily usable by both academia and industry. Paper 1 is ambitious and potentially unifying, but its Lagrangian-based general theory may face higher barriers to validation, adoption, and standardization, making near-term impact less certain despite high conceptual novelty.
Paper 2 proposes a unifying general theory for interpretable ML grounded in Lagrangian mechanics, addressing a fundamental gap in a rapidly growing field. Its breadth of impact is exceptional—spanning traditional ML, concept-based, and mechanistic interpretability—while offering both theoretical foundations and practical design principles. It has potential to reshape how the community approaches interpretability research and education. Paper 1, while technically strong with theoretical guarantees for latent dynamics recovery, addresses a more specialized problem. Paper 2's unifying framework across a fragmented discipline gives it broader and potentially more transformative impact.
Paper 1 introduces a general theoretical framework (SIM) for interpretable ML grounded in Lagrangian mechanics, addressing a fundamental gap between theory and practice in AI interpretability. Its breadth of impact is substantial—it unifies fragmented subfields (concept-based, mechanistic, traditional interpretability), provides deductive design principles, and offers pedagogical foundations. The timeliness is high given growing regulatory and scientific demand for interpretable AI. Paper 2 presents a solid but more incremental contribution applying INRs to policy representation learning—a narrower problem with fewer cross-cutting implications for the broader ML community.
Paper 2 proposes a unifying mathematical framework for AI interpretability, a critical and rapidly growing field. By grounding interpretability in Lagrangian mechanics, it offers a fundamental paradigm shift with broad theoretical and practical implications across all of machine learning. Paper 1, while practically valuable, presents a more specialized architectural improvement for point cloud-based robotic manipulation, giving it a narrower scope of impact.
Paper 2 likely has higher near-term scientific impact: it identifies a concrete, broadly relevant inference pathology in Transformer decoders (over-reliance on priors vs. input evidence) and proposes a training-free, plug-and-play fix with strong empirical gains on standard proteomics benchmarks and minimal overhead—high application value and methodological rigor. Paper 1 is ambitious and potentially transformative, but its impact depends on community adoption and validation of a broad theoretical framework; such general theories often face slower uptake and harder empirical falsification.
Paper 1 introduces a comprehensive general theory (SIM) for interpretable ML grounded in Lagrangian mechanics, addressing a fundamental gap in a critical field. It provides a unifying framework that connects fragmented approaches (traditional, concept-based, mechanistic interpretability), offers deductive design principles, and demonstrates empirical utility. Its breadth of impact spans AI safety, education, and method design. Paper 2 is a useful discussion/position paper on uncertainty in dynamical systems but is more incremental—clarifying existing concepts rather than introducing a novel theoretical framework with broad applicability.
Paper 2 demonstrates higher potential scientific impact due to its breadth and timeliness. While Paper 1 introduces an innovative architecture for sequential data using chess, Paper 2 proposes a unifying, general theory for interpretable ML, a highly critical and fragmented area in AI. By grounding interpretability in Lagrangian mechanics, Paper 2 provides a foundational framework applicable to traditional, concept-based, and mechanistic interpretability. This theoretical contribution has far-reaching implications for model safety, debugging, and cross-domain AI research, offering significantly broader impact than the domain-specific representation learning results of Paper 1.
Paper 2 proposes a foundational, unifying theory for machine learning interpretability using Lagrangian mechanics. By addressing a critical bottleneck in AI adoption and offering a deductive framework that spans multiple interpretability subfields, it has a vastly broader potential impact than Paper 1, which, while methodologically rigorous, focuses on improving computational efficiency for a specific mathematical problem (Optimal Transport).