RULER: Representation-Level Verification of Machine Unlearning

Georgina Cosma, Axel Finke

May 26, 2026

arXiv:2605.27569v1 PDF

cs.AI(primary)

#190of 2682·Artificial Intelligence

#190 of 2682 · Artificial Intelligence

Tournament Score

1526±48

10501800

79%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor7.5

Novelty6.5

Clarity7.5

Tournament Score

1526±48

10501800

79%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Machine unlearning aims to remove the influence of specific training records from a deployed model without retraining from scratch. Current protocols verify this at the output level through membership inference, retain accuracy, and forget-set accuracy, but a model can satisfy all three whilst still encoding forgotten records in its intermediate representations. We introduce RULER, a set of representation-level verification metrics. The oracle-comparative metric M2 measures whether forget-set records occupy the same representational position as in a model retrained without them. The oracle-free metric M4 detects residuals from the unlearned model's internal similarity structure alone, without retraining. Four approximate unlearning methods all pass output-level evaluation, yet under a linear mixed-effects model M2 detects significant residuals in 10 of 12 conditions (p<0.05), with effect sizes growing as the forget fraction increases. A fifth method, Bad Teacher, shows the same residuals despite a different forgetting mechanism. M4 acts as a pre-unlearning diagnostic across tabular, image, clinical text, and face-identity settings: it detects identity-level memorisation in face recognition models where no tested method fully erases the signal.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: RULER — Representation-Level Verification of Machine Unlearning

1. Core Contribution

RULER identifies and addresses a concrete gap in machine unlearning evaluation: current protocols (MIA accuracy, retain/forget accuracy) operate only at the output level, but models can pass all three while still encoding forgotten records in intermediate representations. The paper introduces two complementary metrics:

M₂ (oracle-comparative): A signed calibration gap measuring whether forget-set records occupy the same representational position as in a retrained-from-scratch oracle, using retain-set similarity as a calibrated baseline.

M₄ (oracle-free): A percentile-rank metric detecting residual memorization from the unlearned model's internal geometry alone, requiring no retraining.

The key empirical finding is a *discordance*: five unlearning methods (Gradient Ascent, NegGrad+, Fine-Tuning, SCRUB, Bad Teacher) all pass output-level evaluation yet show statistically significant representation-level residuals under M₂ in 10 of 12 conditions. This is a meaningful contribution because it challenges the assumption that passing standard evaluation protocols implies successful unlearning.

2. Methodological Rigor

The experimental design is commendably thorough:

Statistical framework: The use of linear mixed-effects models with random intercepts for dataset, paired with Wilcoxon signed-rank tests and rank-biserial effect sizes, is appropriate for the hierarchical data structure (10 datasets × 10 seeds). The authors acknowledge that 10 clusters may overstate LMM significance and provide the more conservative Wilcoxon test alongside.

Paired-seed design: Sharing initialization between the original and oracle models isolates unlearning effects from geometric variation across random initializations — a critical design choice that is well-justified.

Robustness checks: The paper includes extensive sensitivity analyses across forget-set sampling (5 additional seeds), learning rates, mini-batch training, and retain-baseline choice (median vs. mean). The median baseline choice is convincingly motivated: the mean reverses the sign in 5/12 conditions due to right-skewness.

However, several limitations merit attention. The primary experiments use a simple two-layer MLP on tabular data, which limits generalizability claims. The diagnostic experiments on images, text, and faces use only 5 seeds with limited statistical power. The paired-seed requirement for M₂ is a practical constraint — it is unavailable when the original model's initialization seed is unknown. The paper also uses cosine similarity at a single layer (penultimate), and it is unclear whether residuals at other layers would tell a different story.

The M₂ effect sizes, while statistically significant, are extremely small in absolute terms (gaps of ~0.001–0.005 in cosine similarity). The practical significance of these residuals — whether they could be exploited by an adversary or constitute meaningful information leakage — remains unaddressed.

3. Potential Impact

For unlearning evaluation: RULER provides a concrete, implementable toolkit that could become part of standard evaluation protocols. The oracle-free M₄ is particularly practical as a pre-unlearning diagnostic, since it requires no retraining. The finding that identity-level memorization in face recognition (LFW, M₄ = 0.73–0.94) persists after unlearning is directly relevant to GDPR compliance scenarios.

For method development: The consistent finding that all tested methods leave representation-level residuals suggests that future unlearning objectives should incorporate representation-level constraints. This could redirect algorithm design toward explicitly penalizing intermediate-layer memorization.

For regulation: While the authors appropriately disclaim that RULER is not a legal compliance test, the metrics could inform regulatory frameworks by providing a more stringent standard than output-level evaluation alone.

Limitations on impact: The paper does not demonstrate that these residuals lead to practical privacy violations (e.g., reconstruction attacks exploiting intermediate representations). Without this connection, the practical urgency of the finding is somewhat uncertain.

4. Timeliness & Relevance

Machine unlearning is increasingly relevant due to GDPR's right to erasure and growing deployment of ML systems. Recent work by Hayes et al. and Goel et al. has shown that standard unlearning evaluations can be misleading, making this a timely contribution. The shift from output-level to representation-level verification fills a recognized gap in the literature. The extension to face-identity and clinical text settings directly addresses high-stakes deployment scenarios.

5. Strengths & Limitations

Key Strengths:

Clear identification of an important gap (output vs. representation-level evaluation)

Well-designed statistical framework with appropriate mixed-effects modeling

Extensive robustness checks across hyperparameters, forget-set sampling, and training regimes

M₄'s oracle-free nature makes it immediately deployable

Cross-domain evaluation (tabular, image, clinical text, face identity)

Code availability enhances reproducibility

Notable Limitations:

Small absolute effect sizes raise questions about practical exploitability

Primary experiments limited to shallow MLPs; deeper architectures appear only in diagnostics with fewer seeds

The paired-seed requirement for M₂ is restrictive in practice

No adversarial attack exploiting the detected residuals is demonstrated

M₄'s high ICC (0.89) means it is primarily a dataset property rather than a method-discriminating metric at the population level

Mini-batch training invalidates M₂'s paired-seed calibration (Table 6), which is the standard training regime for large-scale models

Limited to penultimate-layer analysis; no multi-layer investigation

Additional observations: The finding that M₂ reverses sign under mini-batch training (Table 6) is concerning, as it suggests the paired-seed calibration is fragile. This substantially limits M₂'s applicability to realistic large-scale settings. M₄ is more robust in this regard but lacks the statistical power for population-level inference.

Overall, RULER makes a solid conceptual and methodological contribution to machine unlearning evaluation. The identification of the output-representation discordance is valuable, and the metrics are well-formulated. However, the practical significance of the detected residuals and the scalability limitations of M₂ temper the immediate impact.

Rating:6.5/ 10

Significance 7Rigor 7.5Novelty 6.5Clarity 7.5

Generated May 28, 2026

Comparison History (24)

vs. DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

claude-opus-4.65/28/2026

RULER addresses a fundamental gap in machine unlearning verification by revealing that models can pass all existing output-level tests while retaining forgotten data in intermediate representations. This has broad implications for privacy, regulatory compliance (GDPR right-to-be-forgotten), and trustworthy AI. The methodology is rigorous (linear mixed-effects models, multiple domains), introduces both oracle-dependent and oracle-free metrics, and exposes a critical blind spot in current evaluation protocols. DenoiseRL offers incremental improvements to RL-based reasoning training, but RULER opens an entirely new evaluation dimension with immediate practical and regulatory relevance across multiple fields.

vs. GS-FUSE: Granger-Supervised Gated Fusion and Multi-Granularity Alignment for Event-Driven Financial Forecasting

gemini-3.15/28/2026

Paper 1 addresses a foundational issue in machine unlearning by exposing the inadequacy of output-level verification and proposing representation-level metrics. Given the increasing regulatory pressure for data privacy and copyright compliance in AI, robust machine unlearning is a critical, highly timely challenge. Its implications span security, privacy, and core ML. Paper 2, while methodologically sound and useful for financial forecasting, presents a domain-specific application of multimodal fusion, which is likely to have a narrower scientific impact compared to the foundational privacy and evaluation mechanisms proposed in Paper 1.

vs. DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

gpt-5.25/28/2026

Paper 2 has higher impact potential due to greater novelty (shifting unlearning verification from outputs to internal representations) and broad applicability across ML domains where privacy, compliance, and safety are critical. Its oracle-free metric (M4) is especially practical, enabling diagnostics without costly retraining, and the finding that many methods “pass” standard tests yet retain residuals could reshape evaluation standards. Methodological rigor is stronger (statistical testing across multiple modalities). Paper 1 is timely and useful for LLM reliability, but is a more incremental systems improvement on an established task.

vs. SkillGrad: Optimizing Agent Skills Like Gradient Descent

gpt-5.25/28/2026

Paper 2 (RULER) has higher impact potential due to its novel shift from output-level to representation-level verification for machine unlearning, addressing a critical safety/privacy gap where existing metrics can be gamed. It offers broadly applicable metrics (including an oracle-free option) across multiple modalities and domains (tabular, images, clinical text, face ID), increasing cross-field relevance and real-world applicability for compliance and trustworthy ML. The methodology is relatively rigorous (mixed-effects modeling, effect sizes, multi-method comparisons) and highly timely given regulatory and deployment pressure for provable data removal.

vs. Calibrating Conservatism for Scalable Oversight

claude-opus-4.65/28/2026

Paper 1 addresses the fundamental and increasingly urgent problem of maintaining human oversight over autonomous AI systems, combining multiple theoretical contributions (conformal decision theory guarantees, attainable utility preservation) into a practical framework with empirical validation. Its breadth of impact spans AI safety, alignment, and deployment policy. Paper 2 makes a solid contribution to machine unlearning verification, but addresses a narrower problem. Paper 1's timeliness given rapid AI agent deployment, its novel theoretical guarantees, and its relevance to the critical challenge of AI safety give it higher potential impact.

vs. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact due to a broader, more general contribution: a controlled framework (ScaleLogic) to study RL scaling for long-horizon reasoning with independently tunable difficulty axes, plus strong empirical laws (power-law scaling with high R²) and actionable conclusions (expressiveness affects scaling exponent and transfer; curricula improve efficiency) that can guide future RL/LLM training. Its relevance to current LLM reasoning and compute-scaling questions and applicability across RL methods and downstream benchmarks suggest wider cross-field uptake than Paper 1’s important but more specialized verification metrics for machine unlearning.

vs. From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

gemini-3.15/28/2026

Paper 1 exposes a fundamental flaw in how machine unlearning is currently evaluated and introduces a robust, representation-level verification framework. By setting a new, rigorous standard for unlearning verification, it has profound implications for AI privacy, security, and legal compliance (e.g., GDPR). While Paper 2 offers a valuable advancement in LLM knowledge editing, Paper 1's potential to redefine the evaluation paradigm of an entire subfield gives it higher broader scientific impact.

vs. Agentic Discovery of Exchange-Correlation Density Functionals

gemini-3.15/28/2026

Paper 1 represents a significant milestone in AI-driven scientific discovery, successfully utilizing LLM agents to improve upon a gold-standard physics baseline by 9%. This advancement in density functional theory has massive downstream implications for computational chemistry, materials science, and drug discovery. While Paper 2 provides crucial insights for AI privacy and unlearning, Paper 1 demonstrates a broader, cross-disciplinary paradigm shift in how scientific laws and models can be autonomously formulated and optimized.

vs. Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

claude-opus-4.65/28/2026

Paper 1 addresses the critical and timely problem of hallucination in multimodal large reasoning models, which is central to the rapidly growing field of LLMs/MLLMs. Its novel decomposition of CoT reasoning from answer optimization (RC-DPO) and the MCTS-based data generation strategy represent meaningful methodological contributions with broad applicability across the fast-expanding multimodal AI ecosystem. Paper 2 makes a solid contribution to machine unlearning verification, but targets a narrower community. The sheer scale of interest in reasoning model reliability and hallucination mitigation gives Paper 1 greater potential for citations and real-world impact.

vs. From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

gemini-3.15/28/2026

Paper 1 addresses a critical and highly timely issue in AI privacy and safety: the failure of current machine unlearning methods to truly erase data at the representation level. Its findings have profound implications for regulatory compliance (e.g., GDPR) and foundation model training across multiple domains. Paper 2, while offering a valuable methodological tool for reproducibility using LLM agents, focuses on a narrower applied field (PHM) and presents an engineering solution rather than exposing a fundamental flaw in existing AI paradigms.

vs. Reasoning and Planning with Dynamically Changing Norms

gpt-5.25/28/2026

Paper 1 likely has higher impact due to strong novelty and timeliness in machine unlearning verification: it identifies a key failure mode of output-level evaluations and introduces representation-level, partially oracle-free metrics validated across multiple modalities and domains with statistical rigor. The work has immediate real-world relevance for privacy, compliance, and safety auditing of deployed ML systems, and could become a standard evaluation layer across unlearning methods. Paper 2 is valuable and rigorous in formal normative planning, but appears narrower in demonstrated scope and empirical breadth (single agent/task), potentially limiting near-term cross-field uptake.

vs. FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

gpt-5.25/28/2026

Paper 1 is more likely to have broad, lasting scientific impact: it introduces novel representation-level verification metrics for machine unlearning, exposing failures of standard output-level protocols across multiple modalities and providing both oracle-comparative and oracle-free tests with statistical evidence. This advances methodological rigor and addresses a timely, high-stakes problem (privacy, compliance, data deletion) relevant across ML, security, and policy. Paper 2 is a well-motivated applied systems/design contribution in fintech, but its novelty is more architectural and domain-specific, with impact likely concentrated in finance tooling rather than generalizable scientific methodology.

vs. Gradient Step Plug-and-Play Model for Dental Cone-Beam CT Reconstruction

gpt-5.25/28/2026

Paper 2 is more novel and broadly impactful: it introduces representation-level verification for machine unlearning, addressing a timely, high-stakes gap in privacy/ML compliance where output-level tests can be misleading. The proposed oracle-comparative (M2) and oracle-free (M4) metrics generalize across modalities (tabular, images, clinical text, face ID) and reveal systematic residual memorization across multiple unlearning methods, suggesting field-wide implications and new evaluation standards. Paper 1 is a solid applied imaging contribution but is narrower in scope and closer to established plug-and-play/denoiser paradigms, with limited demonstrated rigor on real clinical data.

vs. LACUNA: Safe Agents as Recursive Program Holes

gemini-3.15/28/2026

Paper 1 addresses a critical flaw in current machine unlearning evaluation by exposing that models retain forgotten data in intermediate representations despite passing output-level tests. Given the rising legal and ethical pressures around data privacy and copyright (e.g., GDPR), a rigorous, representation-level verification method has profound, immediate implications across the entire machine learning community. Paper 2 presents a novel and useful programming model for LLM agents, but Paper 1's fundamental challenge to the efficacy of existing unlearning techniques promises broader and more disruptive scientific impact.

vs. TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

gpt-5.25/28/2026

Paper 1 likely has higher scientific impact due to its novel shift from output-level to representation-level verification for machine unlearning, addressing a critical gap with immediate implications for privacy, compliance, and safety auditing. It proposes both oracle-comparative and oracle-free metrics, demonstrates failures of multiple existing unlearning methods across diverse modalities, and uses statistically grounded analysis (mixed-effects models, effect sizes). This evaluation framework can influence many unlearning algorithms and regulatory practices. Paper 2 is timely and useful for multi-LLM coordination, but its impact is more benchmark- and paradigm-specific and may be outpaced quickly by fast-moving RL/agent methods.

vs. Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

gemini-3.15/28/2026

Paper 1 offers profound interdisciplinary scientific impact by directly linking frontier Large Reasoning Models with human cognitive and neural mechanisms using fMRI data. While Paper 2 provides critical engineering and safety evaluations for machine unlearning, Paper 1 advances fundamental scientific understanding of both artificial and biological intelligence, offering a novel computational account of human learning in complex environments.

vs. Let Relations Speak: An End-to-End LLM-GNN Soft Prompt Framework for Fraud Detection

gemini-3.15/28/2026

Paper 1 exposes a fundamental flaw in current machine unlearning evaluations, demonstrating that models retaining knowledge at the representation level can still pass standard output-level tests. Its introduction of representation-level metrics (RULER) has profound implications for AI safety, privacy, and copyright compliance. Paper 2 presents a strong, domain-specific methodological improvement for fraud detection combining LLMs and GNNs, but Paper 1's findings challenge an entire subfield's evaluation paradigm, giving it significantly broader and more disruptive potential scientific impact.

vs. Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

gpt-5.25/28/2026

Paper 2 has higher impact potential because it addresses a foundational and timely problem—verifying machine unlearning—where existing evaluation can be misleading. Its key novelty is shifting verification from outputs to representations, revealing failures that current protocols miss, and it provides both oracle-comparative and oracle-free metrics applicable across diverse modalities (tabular, vision, clinical text, face ID). This broadens relevance across privacy, security, and regulation-driven ML deployment. While Paper 1 is strong and rigorous, its scope is narrower (LLM hallucination detection) and primarily improves efficiency within an already crowded detection landscape.

vs. AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

claude-opus-4.65/28/2026

RULER addresses a fundamental gap in machine unlearning verification by revealing that models can pass existing output-level evaluations while still retaining forgotten data in intermediate representations. This has significant implications for privacy regulations (GDPR, right to be forgotten), AI safety, and trustworthy ML. The work introduces principled metrics (M2, M4) with rigorous statistical analysis across diverse domains (tabular, image, clinical text, face recognition). Its broader impact spans privacy, regulation compliance, and model auditing. AsyncTool, while valuable, addresses a more niche evaluation gap in LLM tool-calling efficiency with narrower implications.

vs. AlphaTransit: Learning to Design City-scale Transit Routes

gpt-5.25/28/2026

Paper 1 likely has higher scientific impact due to introducing a broadly applicable, timely verification layer for machine unlearning—an area central to privacy, compliance, and trustworthy ML. The representation-level metrics (including an oracle-free diagnostic) address a clear loophole in prevailing evaluations and are demonstrated across multiple modalities and domains, suggesting wide cross-field adoption potential. Paper 2 is strong and practical but is more domain-specific (transit planning) and methodologically more incremental (MCTS+policy/value, AlphaZero-style), with impact likely concentrated in operations research/transportation.