Interference-Aware Multi-Task Unlearning

Ying-Hua Huang, Rui Fang, Hsi-Wen Chen, Ming-Syan Chen

May 18, 2026

arXiv:2605.19042v1 PDF

cs.AI(primary)

#1297of 2292·Artificial Intelligence

#1297 of 2292 · Artificial Intelligence

Tournament Score

1398±41

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6.5

Novelty7

Clarity7.5

Tournament Score

1398±41

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Machine unlearning aims to remove the contribution of designated training data from a trained model while preserving performance on the remaining data. Existing work mainly focuses on single-task settings, whereas modern models often operate in multi-task setups with shared backbones, where removing supervision for one task or instance can unintentionally affect others. We introduce multi-task unlearning with two settings: full-task unlearning, which removes a target instance from all tasks, and partial-task unlearning, which removes supervision only from selected tasks. We show that shared parameters couple the forget and retain sets, causing task-level interference on non-target tasks and instance-level interference on other instances. To address this issue, we propose an interference-aware framework that combines task-aware gradient projection, which constrains updates within task-specific subspaces, with instance-level gradient orthogonalization, which reduces conflicts between forget and retain signals. Experiments on two multi-task computer vision benchmarks across five tasks show that our method achieves effective unlearning while maintaining strong generalization, reducing UIS compared with the strongest baseline by 30.3% in full-task unlearning and 52.9% in partial-task unlearning.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Interference-Aware Multi-Task Unlearning

1. Core Contribution

This paper introduces the multi-task unlearning problem, distinguishing two settings: full-task unlearning (removing an instance from all tasks) and partial-task unlearning (removing supervision from selected tasks while retaining others). The key insight is that shared parameters in multi-task models create coupling between forget and retain sets, inducing task-level interference (degradation on non-target tasks) and instance-level interference (degradation on other instances). The proposed framework combines task-aware gradient projection (constraining updates to task-specific subspaces) with instance-level gradient orthogonalization (removing conflicting components between forget and retain gradients).

The problem formulation itself is the most significant contribution. While machine unlearning has been extensively studied in single-task settings, real-world deployments increasingly rely on multi-task architectures where a single input carries supervision for multiple tasks. The partial-task unlearning scenario—where an image should be forgotten for person identification but retained for action recognition—is practically compelling and previously unaddressed.

2. Methodological Rigor

Theoretical Foundation: The paper provides a formal analysis grounded in influence-function-style reasoning. Theorem 1 characterizes the first-order loss change induced by data removal through Hessian-preconditioned gradient coupling, establishing that both task- and instance-level interference share the same mathematical structure (Corollary 1). Proposition 1 demonstrates that naively following the unlearning gradient is suboptimal unless it aligns with an eigenvector of the retain Hessian—a condition that is generically not satisfied. These results motivate the method but rely on standard assumptions (local quadratic approximation, invertible Hessian, small forget-to-retain ratio).

Method Design: The low-rank update formulation (Eq. 6) is parameter-efficient and practical. Task-aware gradient projection via orthonormal task-specific bases with mutual orthogonality regularization is well-motivated by continual learning literature. The sequential orthogonalization scheme (clean → instance → task) is a reasonable heuristic, though the paper does not formally justify this specific ordering beyond intuitive reasoning.

Experimental Setup: The evaluation is reasonably comprehensive: two benchmarks (NYUv2 and PASCAL), five tasks, two backbone architectures (ViT-L and Swin-L), six baselines, and the UIS metric that captures multi-dimensional tradeoffs. The 10-run averaging with reported standard deviations ≤3% adds reliability. The scalability analysis (Figure 2) from 10% to 50% unlearn ratios is valuable.

Concerns: The MIA evaluation uses a simple loss-based threshold rather than a trained attack classifier, which may underestimate privacy leakage. The early stopping criterion based on MIA score closest to retrained reference could introduce bias favoring the method. The UIS metric, while reasonable, involves somewhat arbitrary aggregation of normalized deviations that could mask important individual failures.

3. Potential Impact

Practical Applications: The partial-task unlearning setting addresses a genuine real-world need. Privacy regulations like GDPR require selective data removal, but in multi-task systems, the same data may have different privacy requirements across tasks. This framework enables fine-grained control that single-task methods cannot provide.

Research Direction: This paper opens a new sub-problem within machine unlearning. Future work could extend this to: (1) NLP and multimodal models with shared representations, (2) federated multi-task learning, (3) more complex task relationships (hierarchical, sequential), and (4) theoretical tighter bounds on interference.

Limitations on Impact: The evaluation is restricted to computer vision with relatively simple multi-task setups (3 tasks on NYUv2, 2 on PASCAL). Modern foundation models may operate over dozens of tasks with complex adapter compositions, and it's unclear how well the approach scales. The reliance on LoRA-based updates, while efficient, may not capture all forms of memorization in the shared backbone.

4. Timeliness & Relevance

The paper is well-timed. Multi-task and multi-modal models are becoming the norm (foundation models, instruction-tuned LLMs, unified vision models), and privacy regulations are simultaneously tightening. The gap between single-task unlearning research and multi-task deployment is real and growing. Parameter-efficient fine-tuning (LoRA) is now standard practice, making the low-rank unlearning formulation directly applicable to current workflows.

However, the paper does not address the most pressing frontier—LLM unlearning with instruction tuning across many tasks—which limits its immediate relevance to the most active research community.

5. Strengths & Limitations

Key Strengths:

Novel and well-motivated problem formulation with clear practical relevance (partial-task unlearning)

Strong empirical results: 30.3% and 52.9% UIS reductions over best baselines in full-task and partial-task settings

Thorough ablation study demonstrating the contribution of each component, with "w/o Task" showing dramatic degradation (164.4% UIS vs 22.0%)

Scalability analysis showing stability up to 50% unlearn ratio while baselines collapse

Generalization across architectures (ViT-L and Swin-L)

Complete theoretical motivation connecting interference to Hessian-preconditioned gradient coupling

Notable Limitations:

Vision-only evaluation: The authors acknowledge this but it limits generalizability claims

Small task count: 2-3 tasks per benchmark; real multi-task systems may have many more

Sequential orthogonalization ordering is heuristically motivated without formal justification or comparison of alternative orderings

No comparison with task-specific adapter removal, which the authors dismiss in a footnote but could serve as a practical baseline

The UIS metric aggregates heterogeneous deviations and may obscure failure modes in individual metrics

Computational overhead of the sequential orthogonalization and task-specific projections is not analyzed

Fixed LoRA rank and subspace dimension: sensitivity to these hyperparameters is not explored

Additional Observations

The paper's connection to continual learning (gradient projection for preventing forgetting) and multi-task optimization (gradient surgery) is appropriate but could be more explicitly discussed. The method essentially adapts techniques from these adjacent fields to the unlearning context, which, while effective, somewhat reduces the algorithmic novelty. The theoretical results, while correct, are relatively standard applications of influence function analysis.

The 10% forget set size is relatively small; real-world requests may involve both very small (single instance) and very large removals. The scalability study partially addresses this but only up to 50%.

Rating:6.5/ 10

Significance 7Rigor 6.5Novelty 7Clarity 7.5

Generated May 20, 2026

Comparison History (26)

vs. Echo: Learning from Experience Data via User-Driven Refinement

gemini-3.15/22/2026

Paper 2 addresses a critical bottleneck in AI scaling—reliance on expensive static human data—by proposing a scalable framework for continuous learning from real-world user interactions. Its validation in a large-scale production environment demonstrates significant and immediate real-world utility. While Paper 1 provides a strong methodological advance in the important niche of multi-task unlearning, Paper 2 offers a broader impact by providing a blueprint for the continuous, automated improvement of broadly deployed AI agents.

vs. AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems

claude-opus-4.65/22/2026

Paper 1 addresses a novel, well-defined technical problem (multi-task unlearning) with a concrete methodological contribution, rigorous experimental validation, and strong quantitative results. It tackles the increasingly important area of machine unlearning in realistic multi-task settings, which has direct implications for data privacy regulations (e.g., GDPR). Paper 2 is a review/chapter synthesizing AI in serious games—while useful, it lacks original empirical contributions and covers well-trodden ground. Paper 1's novelty, methodological rigor, and timeliness give it higher impact potential.

vs. The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems

gpt-5.25/22/2026

Paper 1 targets a timely, high-stakes problem (machine unlearning) with a novel extension to realistic multi-task/shared-backbone settings and proposes concrete, technically grounded methods (task-aware projection + instance-level orthogonalization) backed by quantitative benchmark gains. This combination of methodological rigor, clear evaluation, and direct applicability to privacy/compliance and model editing suggests strong scientific impact. Paper 2 is conceptually interesting for agent infrastructure (auditability, replay, forking) but reads more like a systems/architecture proposal with limited empirical validation and narrower methodological contribution, making near-term scientific impact less certain.

vs. Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

gemini-3.15/22/2026

Paper 2 addresses a critical flaw in Multimodal Large Language Models—the 'Clever Hans' effect in personality perception—by introducing a novel task, dataset, and benchmark. Its exposure of the 'Prejudice Gap' has broad implications for AI safety, fairness, and human-computer interaction, likely driving significant future research. While Paper 1 provides a strong technical solution for a specific subfield (multi-task unlearning), Paper 2's focus on foundational MLLM evaluation offers wider immediate relevance and broader interdisciplinary impact.

vs. ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

gemini-3.15/22/2026

Paper 2 addresses machine unlearning in multi-task environments, a crucial and emerging challenge for privacy-compliant AI. By defining full- and partial-task unlearning and mitigating interference, it offers foundational insights applicable across various ML domains. While Paper 1 provides a highly effective solution for video MLLM efficiency, Paper 2's focus on data rights, privacy, and multi-task architectures gives it broader theoretical and societal impact.

vs. Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact: it tackles machine unlearning in realistic multi-task settings, a timely problem with clear safety, privacy, and compliance applications. The interference analysis (task- and instance-level coupling) and the proposed gradient-projection/orthogonalization framework are broadly applicable to many shared-backbone models beyond vision, potentially influencing unlearning, continual learning, and multi-objective optimization. Paper 1 shows strong empirical gains for LLM agent skill hygiene, but it is more domain-specific, relies on a particular agent loop, and may be more sensitive to model/tooling choices, limiting breadth relative to unlearning.

vs. Mind the Sim-to-Real Gap & Think Like a Scientist

gemini-3.15/21/2026

Paper 1 addresses the sim-to-real gap, a fundamental bottleneck in applying reinforcement learning and sequential decision-making to physical systems. By providing a rigorous theoretical decomposition of value errors and introducing a novel simulation-aided experimental policy (Fisher-SEP), it offers profound methodological advancements. Its broad applicability across fields like operations research, robotics, and public health gives it higher potential for cross-disciplinary impact compared to Paper 2, which focuses on the more specialized, albeit timely, subfield of multi-task machine unlearning.

vs. Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

claude-opus-4.65/21/2026

Paper 1 addresses a fundamental and technically rigorous problem in machine unlearning for multi-task settings, which is increasingly relevant given data privacy regulations (e.g., GDPR) and the prevalence of shared-backbone multi-task models. It introduces a novel formalization (full-task vs. partial-task unlearning), identifies interference issues, and proposes a principled gradient-based solution with strong empirical results. Paper 2, while valuable for AI education, is primarily a pedagogical contribution with a domain-specific benchmark; its scientific impact is narrower, more niche, and less likely to influence broader ML research directions.

vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

gpt-5.25/21/2026

Paper 2 likely has higher scientific impact because it introduces a broadly usable, publicly released benchmark and auditing framework for evaluating deep-research capabilities of frontier LMs—an urgent, fast-moving area with wide adoption potential across academia and industry. Its design (cross-source evidence, long-horizon derivation, calibration, provenance disclosure) can shape evaluation standards and enable reproducible comparisons across many model families and applications. Paper 1 is novel and methodologically sound but is narrower (multi-task CV unlearning) and less likely to influence multiple fields at the same scale.

vs. Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

gemini-3.15/21/2026

Paper 2 addresses machine unlearning in multi-task environments, a critical and rapidly growing area driven by AI privacy, safety, and regulatory compliance. Its technical innovations in gradient projection and orthogonalization offer rigorous, generalizable solutions to fundamental ML challenges. While Paper 1 presents a valuable pedagogical approach and benchmark, Paper 2's methodological rigor and broad applicability across core AI systems give it a higher potential for foundational scientific impact.

vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

claude-opus-4.65/21/2026

DeepWeb-Bench addresses a timely and high-visibility problem—evaluating frontier AI deep research agents—with a rigorous benchmark that reveals actionable insights (retrieval isn't the bottleneck; derivation and calibration are). It has broad impact across the AI evaluation community, is directly relevant to rapidly advancing commercial products, and provides public resources. Paper 2 makes a solid contribution to multi-task unlearning but addresses a narrower problem with more incremental novelty. The benchmark paper is likely to be widely adopted and cited given the explosive interest in agentic AI systems.

vs. Mind the Sim-to-Real Gap & Think Like a Scientist

gpt-5.25/21/2026

Paper 2 has higher potential impact due to broader, more foundational contributions: it provides general theoretical decompositions of sim-to-real error and policy value gaps (including an impossibility/limitation under passive learning) and proposes a principled experiment-design algorithm (Fisher-SEP) applicable across many sequential decision domains. Its relevance is timely for deployment of simulator-trained planners in robotics, operations, healthcare, and RL, and the theory-to-practice link via case studies strengthens applicability. Paper 1 is valuable but more specialized to multi-task deep learning unlearning and likely narrower in cross-field reach.

vs. What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental question about how code and structured reasoning signals contribute to LLM capabilities, with experiments at massive scale (10T tokens). Its findings—that structured reasoning traces rather than executable code drive reasoning improvements—have broad implications for foundation model training data composition across the entire field. Paper 1 tackles a more niche problem (multi-task unlearning) with solid but incremental contributions limited to computer vision benchmarks. Paper 2's insights are more timely, broadly applicable, and likely to influence how major labs design pretraining data mixtures.

vs. Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

claude-opus-4.65/20/2026

Paper 2 addresses a timely and critical safety concern with Large Reasoning Models (LRMs), a rapidly emerging area. Its novel insight connecting attention patterns to jailbreak vulnerability, combined with an RL-based attack framework, has broader impact across AI safety, alignment research, and policy. The findings affect both open-source and closed-source models, increasing practical relevance. Paper 1, while methodologically sound in extending unlearning to multi-task settings, addresses a more incremental and narrower problem. The AI safety implications and timeliness of Paper 2 give it higher potential impact across the research community.

vs. MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

gpt-5.25/20/2026

Paper 1 likely has higher scientific impact due to stronger novelty and broader, timely relevance: it extends machine unlearning to realistic multi-task shared-backbone settings and formalizes/mitigates interference via principled gradient projection/orthogonalization, which can generalize beyond vision to many multi-task architectures. It also targets an important regulatory/trustworthy-ML need (data deletion), with clearer implications for compliance and deployment. Paper 2 is timely for LLM agents and offers a useful multi-objective optimizer, but the methodological advance is more incremental (scalarization + annealing) and its impact may be narrower and more system/prompt-engineering dependent.

vs. SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

gemini-3.15/20/2026

Paper 1 tackles a fundamental challenge in trustworthy AI by extending machine unlearning to multi-task settings, a crucial step for modern shared-backbone models. Its rigorous methodology addressing task and instance-level interference offers broad applicability across AI domains, particularly for privacy compliance. While Paper 2 presents a highly valuable commercial application of VLMs, Paper 1 provides foundational methodological advancements with wider theoretical and cross-disciplinary implications in machine learning.

vs. DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG

claude-opus-4.65/20/2026

DARE-EEG addresses a fundamental challenge in EEG foundation models—mask-invariance in self-supervised learning—with broad applicability across brain-computer interface tasks. Foundation models for biosignals are a rapidly growing area with significant real-world medical and neuroscience applications. The paper introduces novel dual-alignment pre-training and a practical conv-linear-probing strategy for cross-dataset portability, demonstrating state-of-the-art results across diverse benchmarks. Paper 2 on multi-task unlearning is a solid contribution but addresses a more niche problem with narrower scope. EEG foundation models have greater potential for cross-field impact in neuroscience, healthcare, and AI.

vs. CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

gpt-5.25/20/2026

Paper 1 is likely to have higher scientific impact due to its methodological novelty and broad relevance: it extends machine unlearning to realistic multi-task shared-backbone settings, identifies concrete interference mechanisms, and proposes a generally applicable optimization framework (task-aware projection + instance-level orthogonalization) with clear quantitative gains. This advances a timely area (unlearning/privacy/compliance) with potential deployment implications across many multi-task domains. Paper 2 provides a valuable evaluation dataset and benchmark for audio-based distress estimation in CBT, but its impact is narrower (mental-health audio evaluation) and more incremental on methodology.

vs. Efficient Elicitation of Collective Disagreements

claude-opus-4.65/20/2026

Paper 2 addresses the timely and practically important problem of machine unlearning in multi-task settings, which is highly relevant given increasing data privacy regulations (GDPR, right to be forgotten) and the prevalence of multi-task learning in modern AI systems. It introduces a novel problem formulation (multi-task unlearning with full/partial settings), identifies a concrete technical challenge (task/instance-level interference), and provides a principled solution with strong empirical results. Paper 1, while methodologically rigorous and interesting for social choice theory, addresses a more niche problem with narrower potential impact across fields. Paper 2's broader applicability to privacy, AI safety, and practical ML systems gives it higher potential impact.

vs. What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

claude-opus-4.65/20/2026

Paper 1 addresses the critical and timely challenge of credit assignment in multi-turn LLM agents using RL, a topic at the intersection of two rapidly growing fields (LLM agents and RLHF). Its systematic study of feedback sources and the SERL framework offer practical, broadly applicable insights for training LLM-based agents. Paper 2 tackles a more niche problem (multi-task unlearning) with solid contributions, but the problem scope and community interest are narrower. Paper 1's relevance to the booming LLM agent ecosystem gives it higher potential for citations and real-world adoption.