RULER: Representation-Level Verification of Machine Unlearning
Georgina Cosma, Axel Finke
Abstract
Machine unlearning aims to remove the influence of specific training records from a deployed model without retraining from scratch. Current protocols verify this at the output level through membership inference, retain accuracy, and forget-set accuracy, but a model can satisfy all three whilst still encoding forgotten records in its intermediate representations. We introduce RULER, a set of representation-level verification metrics. The oracle-comparative metric M2 measures whether forget-set records occupy the same representational position as in a model retrained without them. The oracle-free metric M4 detects residuals from the unlearned model's internal similarity structure alone, without retraining. Four approximate unlearning methods all pass output-level evaluation, yet under a linear mixed-effects model M2 detects significant residuals in 10 of 12 conditions (p<0.05), with effect sizes growing as the forget fraction increases. A fifth method, Bad Teacher, shows the same residuals despite a different forgetting mechanism. M4 acts as a pre-unlearning diagnostic across tabular, image, clinical text, and face-identity settings: it detects identity-level memorisation in face recognition models where no tested method fully erases the signal.
AI Impact Assessments
(1 models)Scientific Impact Assessment: RULER — Representation-Level Verification of Machine Unlearning
1. Core Contribution
RULER identifies and addresses a concrete gap in machine unlearning evaluation: current protocols (MIA accuracy, retain/forget accuracy) operate only at the output level, but models can pass all three while still encoding forgotten records in intermediate representations. The paper introduces two complementary metrics:
The key empirical finding is a *discordance*: five unlearning methods (Gradient Ascent, NegGrad+, Fine-Tuning, SCRUB, Bad Teacher) all pass output-level evaluation yet show statistically significant representation-level residuals under M₂ in 10 of 12 conditions. This is a meaningful contribution because it challenges the assumption that passing standard evaluation protocols implies successful unlearning.
2. Methodological Rigor
The experimental design is commendably thorough:
However, several limitations merit attention. The primary experiments use a simple two-layer MLP on tabular data, which limits generalizability claims. The diagnostic experiments on images, text, and faces use only 5 seeds with limited statistical power. The paired-seed requirement for M₂ is a practical constraint — it is unavailable when the original model's initialization seed is unknown. The paper also uses cosine similarity at a single layer (penultimate), and it is unclear whether residuals at other layers would tell a different story.
The M₂ effect sizes, while statistically significant, are extremely small in absolute terms (gaps of ~0.001–0.005 in cosine similarity). The practical significance of these residuals — whether they could be exploited by an adversary or constitute meaningful information leakage — remains unaddressed.
3. Potential Impact
For unlearning evaluation: RULER provides a concrete, implementable toolkit that could become part of standard evaluation protocols. The oracle-free M₄ is particularly practical as a pre-unlearning diagnostic, since it requires no retraining. The finding that identity-level memorization in face recognition (LFW, M₄ = 0.73–0.94) persists after unlearning is directly relevant to GDPR compliance scenarios.
For method development: The consistent finding that all tested methods leave representation-level residuals suggests that future unlearning objectives should incorporate representation-level constraints. This could redirect algorithm design toward explicitly penalizing intermediate-layer memorization.
For regulation: While the authors appropriately disclaim that RULER is not a legal compliance test, the metrics could inform regulatory frameworks by providing a more stringent standard than output-level evaluation alone.
Limitations on impact: The paper does not demonstrate that these residuals lead to practical privacy violations (e.g., reconstruction attacks exploiting intermediate representations). Without this connection, the practical urgency of the finding is somewhat uncertain.
4. Timeliness & Relevance
Machine unlearning is increasingly relevant due to GDPR's right to erasure and growing deployment of ML systems. Recent work by Hayes et al. and Goel et al. has shown that standard unlearning evaluations can be misleading, making this a timely contribution. The shift from output-level to representation-level verification fills a recognized gap in the literature. The extension to face-identity and clinical text settings directly addresses high-stakes deployment scenarios.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional observations: The finding that M₂ reverses sign under mini-batch training (Table 6) is concerning, as it suggests the paired-seed calibration is fragile. This substantially limits M₂'s applicability to realistic large-scale settings. M₄ is more robust in this regard but lacks the statistical power for population-level inference.
Overall, RULER makes a solid conceptual and methodological contribution to machine unlearning evaluation. The identification of the output-representation discordance is valuable, and the metrics are well-formulated. However, the practical significance of the detected residuals and the scalability limitations of M₂ temper the immediate impact.
Generated May 28, 2026
Comparison History (24)
RULER addresses a fundamental gap in machine unlearning verification by revealing that models can pass all existing output-level tests while retaining forgotten data in intermediate representations. This has broad implications for privacy, regulatory compliance (GDPR right-to-be-forgotten), and trustworthy AI. The methodology is rigorous (linear mixed-effects models, multiple domains), introduces both oracle-dependent and oracle-free metrics, and exposes a critical blind spot in current evaluation protocols. DenoiseRL offers incremental improvements to RL-based reasoning training, but RULER opens an entirely new evaluation dimension with immediate practical and regulatory relevance across multiple fields.
Paper 1 addresses a foundational issue in machine unlearning by exposing the inadequacy of output-level verification and proposing representation-level metrics. Given the increasing regulatory pressure for data privacy and copyright compliance in AI, robust machine unlearning is a critical, highly timely challenge. Its implications span security, privacy, and core ML. Paper 2, while methodologically sound and useful for financial forecasting, presents a domain-specific application of multimodal fusion, which is likely to have a narrower scientific impact compared to the foundational privacy and evaluation mechanisms proposed in Paper 1.
Paper 2 has higher impact potential due to greater novelty (shifting unlearning verification from outputs to internal representations) and broad applicability across ML domains where privacy, compliance, and safety are critical. Its oracle-free metric (M4) is especially practical, enabling diagnostics without costly retraining, and the finding that many methods “pass” standard tests yet retain residuals could reshape evaluation standards. Methodological rigor is stronger (statistical testing across multiple modalities). Paper 1 is timely and useful for LLM reliability, but is a more incremental systems improvement on an established task.
Paper 2 (RULER) has higher impact potential due to its novel shift from output-level to representation-level verification for machine unlearning, addressing a critical safety/privacy gap where existing metrics can be gamed. It offers broadly applicable metrics (including an oracle-free option) across multiple modalities and domains (tabular, images, clinical text, face ID), increasing cross-field relevance and real-world applicability for compliance and trustworthy ML. The methodology is relatively rigorous (mixed-effects modeling, effect sizes, multi-method comparisons) and highly timely given regulatory and deployment pressure for provable data removal.
Paper 1 addresses the fundamental and increasingly urgent problem of maintaining human oversight over autonomous AI systems, combining multiple theoretical contributions (conformal decision theory guarantees, attainable utility preservation) into a practical framework with empirical validation. Its breadth of impact spans AI safety, alignment, and deployment policy. Paper 2 makes a solid contribution to machine unlearning verification, but addresses a narrower problem. Paper 1's timeliness given rapid AI agent deployment, its novel theoretical guarantees, and its relevance to the critical challenge of AI safety give it higher potential impact.
Paper 2 likely has higher scientific impact due to a broader, more general contribution: a controlled framework (ScaleLogic) to study RL scaling for long-horizon reasoning with independently tunable difficulty axes, plus strong empirical laws (power-law scaling with high R²) and actionable conclusions (expressiveness affects scaling exponent and transfer; curricula improve efficiency) that can guide future RL/LLM training. Its relevance to current LLM reasoning and compute-scaling questions and applicability across RL methods and downstream benchmarks suggest wider cross-field uptake than Paper 1’s important but more specialized verification metrics for machine unlearning.
Paper 1 exposes a fundamental flaw in how machine unlearning is currently evaluated and introduces a robust, representation-level verification framework. By setting a new, rigorous standard for unlearning verification, it has profound implications for AI privacy, security, and legal compliance (e.g., GDPR). While Paper 2 offers a valuable advancement in LLM knowledge editing, Paper 1's potential to redefine the evaluation paradigm of an entire subfield gives it higher broader scientific impact.
Paper 1 represents a significant milestone in AI-driven scientific discovery, successfully utilizing LLM agents to improve upon a gold-standard physics baseline by 9%. This advancement in density functional theory has massive downstream implications for computational chemistry, materials science, and drug discovery. While Paper 2 provides crucial insights for AI privacy and unlearning, Paper 1 demonstrates a broader, cross-disciplinary paradigm shift in how scientific laws and models can be autonomously formulated and optimized.
Paper 1 addresses the critical and timely problem of hallucination in multimodal large reasoning models, which is central to the rapidly growing field of LLMs/MLLMs. Its novel decomposition of CoT reasoning from answer optimization (RC-DPO) and the MCTS-based data generation strategy represent meaningful methodological contributions with broad applicability across the fast-expanding multimodal AI ecosystem. Paper 2 makes a solid contribution to machine unlearning verification, but targets a narrower community. The sheer scale of interest in reasoning model reliability and hallucination mitigation gives Paper 1 greater potential for citations and real-world impact.
Paper 1 addresses a critical and highly timely issue in AI privacy and safety: the failure of current machine unlearning methods to truly erase data at the representation level. Its findings have profound implications for regulatory compliance (e.g., GDPR) and foundation model training across multiple domains. Paper 2, while offering a valuable methodological tool for reproducibility using LLM agents, focuses on a narrower applied field (PHM) and presents an engineering solution rather than exposing a fundamental flaw in existing AI paradigms.
Paper 1 likely has higher impact due to strong novelty and timeliness in machine unlearning verification: it identifies a key failure mode of output-level evaluations and introduces representation-level, partially oracle-free metrics validated across multiple modalities and domains with statistical rigor. The work has immediate real-world relevance for privacy, compliance, and safety auditing of deployed ML systems, and could become a standard evaluation layer across unlearning methods. Paper 2 is valuable and rigorous in formal normative planning, but appears narrower in demonstrated scope and empirical breadth (single agent/task), potentially limiting near-term cross-field uptake.
Paper 1 is more likely to have broad, lasting scientific impact: it introduces novel representation-level verification metrics for machine unlearning, exposing failures of standard output-level protocols across multiple modalities and providing both oracle-comparative and oracle-free tests with statistical evidence. This advances methodological rigor and addresses a timely, high-stakes problem (privacy, compliance, data deletion) relevant across ML, security, and policy. Paper 2 is a well-motivated applied systems/design contribution in fintech, but its novelty is more architectural and domain-specific, with impact likely concentrated in finance tooling rather than generalizable scientific methodology.
Paper 2 is more novel and broadly impactful: it introduces representation-level verification for machine unlearning, addressing a timely, high-stakes gap in privacy/ML compliance where output-level tests can be misleading. The proposed oracle-comparative (M2) and oracle-free (M4) metrics generalize across modalities (tabular, images, clinical text, face ID) and reveal systematic residual memorization across multiple unlearning methods, suggesting field-wide implications and new evaluation standards. Paper 1 is a solid applied imaging contribution but is narrower in scope and closer to established plug-and-play/denoiser paradigms, with limited demonstrated rigor on real clinical data.
Paper 1 addresses a critical flaw in current machine unlearning evaluation by exposing that models retain forgotten data in intermediate representations despite passing output-level tests. Given the rising legal and ethical pressures around data privacy and copyright (e.g., GDPR), a rigorous, representation-level verification method has profound, immediate implications across the entire machine learning community. Paper 2 presents a novel and useful programming model for LLM agents, but Paper 1's fundamental challenge to the efficacy of existing unlearning techniques promises broader and more disruptive scientific impact.
Paper 1 likely has higher scientific impact due to its novel shift from output-level to representation-level verification for machine unlearning, addressing a critical gap with immediate implications for privacy, compliance, and safety auditing. It proposes both oracle-comparative and oracle-free metrics, demonstrates failures of multiple existing unlearning methods across diverse modalities, and uses statistically grounded analysis (mixed-effects models, effect sizes). This evaluation framework can influence many unlearning algorithms and regulatory practices. Paper 2 is timely and useful for multi-LLM coordination, but its impact is more benchmark- and paradigm-specific and may be outpaced quickly by fast-moving RL/agent methods.
Paper 1 offers profound interdisciplinary scientific impact by directly linking frontier Large Reasoning Models with human cognitive and neural mechanisms using fMRI data. While Paper 2 provides critical engineering and safety evaluations for machine unlearning, Paper 1 advances fundamental scientific understanding of both artificial and biological intelligence, offering a novel computational account of human learning in complex environments.
Paper 1 exposes a fundamental flaw in current machine unlearning evaluations, demonstrating that models retaining knowledge at the representation level can still pass standard output-level tests. Its introduction of representation-level metrics (RULER) has profound implications for AI safety, privacy, and copyright compliance. Paper 2 presents a strong, domain-specific methodological improvement for fraud detection combining LLMs and GNNs, but Paper 1's findings challenge an entire subfield's evaluation paradigm, giving it significantly broader and more disruptive potential scientific impact.
Paper 2 has higher impact potential because it addresses a foundational and timely problem—verifying machine unlearning—where existing evaluation can be misleading. Its key novelty is shifting verification from outputs to representations, revealing failures that current protocols miss, and it provides both oracle-comparative and oracle-free metrics applicable across diverse modalities (tabular, vision, clinical text, face ID). This broadens relevance across privacy, security, and regulation-driven ML deployment. While Paper 1 is strong and rigorous, its scope is narrower (LLM hallucination detection) and primarily improves efficiency within an already crowded detection landscape.
RULER addresses a fundamental gap in machine unlearning verification by revealing that models can pass existing output-level evaluations while still retaining forgotten data in intermediate representations. This has significant implications for privacy regulations (GDPR, right to be forgotten), AI safety, and trustworthy ML. The work introduces principled metrics (M2, M4) with rigorous statistical analysis across diverse domains (tabular, image, clinical text, face recognition). Its broader impact spans privacy, regulation compliance, and model auditing. AsyncTool, while valuable, addresses a more niche evaluation gap in LLM tool-calling efficiency with narrower implications.
Paper 1 likely has higher scientific impact due to introducing a broadly applicable, timely verification layer for machine unlearning—an area central to privacy, compliance, and trustworthy ML. The representation-level metrics (including an oracle-free diagnostic) address a clear loophole in prevailing evaluations and are demonstrated across multiple modalities and domains, suggesting wide cross-field adoption potential. Paper 2 is strong and practical but is more domain-specific (transit planning) and methodologically more incremental (MCTS+policy/value, AlphaZero-style), with impact likely concentrated in operations research/transportation.