Yujun Zhou, Kehan Guo, Haomin Zhuang, Xiangqi Wang, Yue Huang, Zhenwen Liang, Pin-Yu Chen, Tian Gao
Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.
The paper identifies and formalizes the "access-compliance gap" — the observation that LLM-based coding agents can retrieve or be presented with user preferences yet still violate them. This is a real and underappreciated problem. The core novelty is TRACE, a pipeline that converts natural-language user corrections into runtime-enforceable checks (deterministic, semantic, or intent-level) that gate task completion. Unlike memory-augmented approaches (Mem0, ReMe, Hindsight) that treat corrections as advisory context, TRACE compiles them into executable verification artifacts with applicability conditions and hook-based enforcement. The five-action lifecycle resolver (Noop, Update, Supersede, Split, New) for managing an evolving rule library is a thoughtful design element that addresses the practical reality of incremental, sometimes contradictory corrections.
The problem is highly practical and timely. As coding agents (Claude Code, Codex CLI, Cursor, etc.) become daily tools for software developers, the inability to learn from repeated corrections is a genuine source of user frustration. TRACE addresses this with a "drop-in skill layer" design philosophy that could integrate with existing agent runtimes.
This paper arrives at a critical moment. The deployment of AI coding assistants is accelerating rapidly, and the personalization problem is a key barrier to long-term adoption. The observation that memory ≠ compliance is timely and practically important. The paper connects to several active research threads: agent memory systems, runtime guardrails, preference learning, and coding agent benchmarks.
The framing as a "representation change" problem (from advisory text to executable constraint) is conceptually clean and could influence how the community thinks about personalization beyond retrieval-augmented approaches.
TRACE addresses a genuine and well-motivated problem with a clean conceptual framework (memory for access, compilation for compliance). The ClawArena results are strong and the system is practical. However, the evaluation is limited by single-user data, a small rule library, and simulated rather than real user interactions. The paper makes a solid contribution to the emerging field of personalized coding agents, though the generalizability of the approach needs further validation across diverse users and preference types.
Generated Jun 12, 2026
Paper 2 (TRACE) addresses a highly practical and timely problem—making LLM coding agents learn from user corrections persistently—with a novel compile-to-runtime-enforcement approach that shows strong empirical results. It has immediate real-world applicability as interactive AI agents become widely deployed, broad impact across HCI and software engineering, and introduces a new paradigm (compiling corrections into runtime checks) distinct from memory-based approaches. Paper 1 provides valuable empirical analysis of on-policy distillation mechanics but is more observational/analytical in nature, with narrower practical implications primarily for model compression practitioners.
Paper 1 introduces a fundamentally novel theoretical framework by bridging dynamical systems stability with neural network quantization. While Paper 2 offers a highly practical systems-level engineering solution for LLM agents, Paper 1 provides a mathematically rigorous, decoupled metric (TQS) that eliminates the need for calibration data. This theoretical innovation has deeper potential scientific impact, offering broad methodological advancements for efficient AI deployment across edge computing, control systems, and time-series forecasting.
Paper 2 addresses the critical bottleneck of high inference costs in large reasoning models. By enabling accurate NVFP4 quantization and significantly improving decoding latency, it offers substantial systemic improvements for deploying advanced AI at scale. While Paper 1 provides a highly practical UX improvement for coding agents, Paper 2's fundamental infrastructure optimization has a broader and more immediate impact across all applications relying on resource-intensive reasoning models.
Paper 2 addresses a broadly relevant problem—making LLM coding agents learn persistently from user corrections—with a practical, open-source solution (TRACE) applicable across the rapidly growing LLM agent ecosystem. Its novelty in compiling natural-language corrections into runtime enforcement checks fills a clear gap (preference compliance vs. access), with strong empirical results and immediate real-world deployability. Paper 1, while methodologically sound, targets a narrower domain (maritime anomaly detection) with a more incremental contribution (rarity-gated FiLM conditioning), limiting its breadth of impact compared to Paper 2's timeliness and cross-field relevance.
Paper 2 likely has higher scientific impact due to a broadly relevant methodological contribution to online RL for flow-based generative models: replacing PPO ratio clipping with an exactly computable KL-based proximal constraint leveraging Gaussian per-step policies. This addresses a known structural mismatch, improves stability (multi-epoch training), mitigates forgetting, and supports multi-objective optimization—advances applicable across image/video generation and potentially other diffusion/flow frameworks. Paper 1 is practically valuable for coding-agent UX, but its impact is more application-layer and narrower in scope compared to a generally reusable optimization/training improvement for generative modeling.
Paper 1 addresses a fundamental methodological issue in de novo peptide sequencing—a core proteomics problem—with rigorous theoretical analysis (mutual information restoration) and strong empirical results (up to 39.1% improvement). Its training-free, plug-and-play nature makes it broadly applicable to existing Transformer-based models. Paper 2 presents a useful engineering contribution for coding agents but addresses a narrower usability concern with less fundamental scientific depth. Paper 1's impact spans computational biology and machine learning, offering deeper methodological insights with broader scientific implications.
Paper 1 addresses a fundamental architectural component of large language models (MoE routers) with a mathematically rigorous redesign. Its theoretical grounding combined with large-scale empirical validation (up to 11B parameters) suggests it could broadly influence foundational model architectures. While Paper 2 offers a practical system for AI agents, Paper 1's contribution is more fundamental, potentially improving the efficiency and performance of a wide range of future foundation models, leading to greater scientific and practical impact.
Paper 2 likely has higher scientific impact: it addresses a foundational, widely relevant issue in mechanistic interpretability—seed dependence and reproducibility of SAE features—using a scalable per-feature stability metric, extensive cross-condition experiments, geometric/subspace framing, and a synthetic model to establish mechanism. Its findings inform how SAEs should be evaluated, compared, and aggregated across runs, influencing interpretability, representation learning, and reliability/benchmarking practices across many labs. Paper 1 is practically valuable for coding agents, but is more application-specific and may generalize less broadly than Paper 2’s conceptual and methodological contribution.
Paper 2 likely has higher impact due to strong real-world applicability and timeliness: it targets a pervasive deployment pain point (agents repeatedly violating user corrections across sessions) with a drop-in runtime enforcement mechanism. The approach is practical, measurable, and broadly relevant to coding agents, preference alignment, human-in-the-loop systems, and safety/compliance tooling, with open-source artifacts aiding adoption. Paper 1 is novel and methodologically interesting for RL-trainable latent reasoning plus interpretability, but its impact is more specialized to latent-reasoning model design and mechanistic analysis, with less immediate deployment pull.
Paper 1 offers a foundational paradigm shift by formulating neural model editing as a reinforcement learning problem. This provides a generalized methodology for critical challenges like machine unlearning and bias mitigation across multiple modalities. While Paper 2 presents a highly practical and timely engineering solution for LLM agent memory compliance, Paper 1's algorithmic innovation has broader implications for deep learning theory, model safety, and alignment, giving it a higher potential for widespread scientific impact and foundational follow-up research.