Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

Yujun Zhou, Kehan Guo, Haomin Zhuang, Xiangqi Wang, Yue Huang, Zhenwen Liang, Pin-Yu Chen, Tian Gao

Jun 11, 2026arXiv:2606.13174v1

cs.LGcs.CL

#2870of 5669·cs.LG

#2870 of 5669 · cs.LG

Tournament Score

1400±46

10501750

53%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor5.5

Novelty6.5

Clarity7.5

Abstract

Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: TRACE — Compiling User Corrections into Runtime Enforcement for Coding Agents

1. Core Contribution

The paper identifies and formalizes the "access-compliance gap" — the observation that LLM-based coding agents can retrieve or be presented with user preferences yet still violate them. This is a real and underappreciated problem. The core novelty is TRACE, a pipeline that converts natural-language user corrections into runtime-enforceable checks (deterministic, semantic, or intent-level) that gate task completion. Unlike memory-augmented approaches (Mem0, ReMe, Hindsight) that treat corrections as advisory context, TRACE compiles them into executable verification artifacts with applicability conditions and hook-based enforcement. The five-action lifecycle resolver (Noop, Update, Supersede, Split, New) for managing an evolving rule library is a thoughtful design element that addresses the practical reality of incremental, sometimes contradictory corrections.

2. Methodological Rigor

Strengths in experimental design:

The paper evaluates across two complementary benchmarks (ClawArena for coding tasks, MemoryArena for memory-intensive tasks) with both in-distribution and out-of-distribution splits, providing a reasonably comprehensive picture.

The frozen-state evaluation protocol — where the target preference is removed from the task prompt and no new corrections are collected during testing — is a clean way to isolate the effect of stored representations.

Multiple baselines (No Memory, Mem0, Hindsight, ReMe-Light) are compared, and the paper honestly reports cases where TRACE's advantage is marginal (MemoryArena OOD).

Concerns:

The diagnostic benchmark (Section 3) is derived from a single user's transcripts (32 transcripts, 19 held-out tasks, 29 preference checks). This is acknowledged but significantly limits generalizability claims. The correction patterns of one AI researcher may not represent diverse user populations.

The simulated user-in-the-loop protocol, while validated with reasonable fidelity metrics (F1=0.906), introduces a synthetic feedback loop. Rule recall of only 0.668 means the simulator misses about a third of the underlying preferences — a non-trivial gap that could affect training-phase rule acquisition quality.

The 47-rule library is relatively small. Scalability to hundreds or thousands of rules, potential rule conflicts, and performance degradation under library growth are not explored.

The paper uses Gemma 4 31B for detection, extraction, and compilation throughout the pipeline. The sensitivity of results to this choice is not ablated.

3. Potential Impact

The problem is highly practical and timely. As coding agents (Claude Code, Codex CLI, Cursor, etc.) become daily tools for software developers, the inability to learn from repeated corrections is a genuine source of user frustration. TRACE addresses this with a "drop-in skill layer" design philosophy that could integrate with existing agent runtimes.

Real-world applicability:

The approach is deployed as skills for Claude Code and Codex CLI, with code publicly available, making it immediately usable.

The deterministic enforcement tier (regex-based checks, file system inspections) is robust and predictable for many coding conventions.

The concept of compiling preferences into executable constraints could extend beyond coding to other agentic domains (document editing, system administration, data analysis).

Limitations on impact:

The semantic and intent-level enforcement tiers are less well-characterized. The paper notes the semantic tier "was not required by any rule in this snapshot," so its effectiveness is untested.

The system currently handles preferences that manifest as observable workspace states or tool-call patterns. Subtler preferences (code style, explanation verbosity, reasoning approach) may be harder to compile into deterministic checks.

The MemoryArena OOD results (97% violation rate for TRACE vs. 99-100% for baselines) suggest the approach struggles significantly when encountering truly novel constraint types.

4. Timeliness & Relevance

This paper arrives at a critical moment. The deployment of AI coding assistants is accelerating rapidly, and the personalization problem is a key barrier to long-term adoption. The observation that memory ≠ compliance is timely and practically important. The paper connects to several active research threads: agent memory systems, runtime guardrails, preference learning, and coding agent benchmarks.

The framing as a "representation change" problem (from advisory text to executable constraint) is conceptually clean and could influence how the community thinks about personalization beyond retrieval-augmented approaches.

5. Strengths & Limitations

Key Strengths:

Clear problem formulation with empirical evidence of the access-compliance gap

Strong ClawArena results, especially OOD (100% → 2.0% violation) — a compelling demonstration

Practical system design with public code and deployable skills

The lifecycle resolver addresses a real engineering challenge of managing evolving rule libraries

Honest reporting of weaker results (MemoryArena OOD)

Notable Limitations:

Single-user diagnostic corpus limits generalizability

The rule library is small (47 entries); scalability is untested

Semantic enforcement tier is essentially unvalidated

The approach requires corrections to be compilable into verifiable conditions — coverage of the full space of user preferences is unclear

The simulated user proxy, while validated, may not capture the diversity and ambiguity of real user corrections

No ablation on the pipeline's components (detection accuracy, compilation quality, resolver accuracy independently)

The MemoryArena ID violation rate of 60.5% is still quite high, suggesting substantial room for improvement even in-distribution

Overall Assessment

TRACE addresses a genuine and well-motivated problem with a clean conceptual framework (memory for access, compilation for compliance). The ClawArena results are strong and the system is practical. However, the evaluation is limited by single-user data, a small rule library, and simulated rather than real user interactions. The paper makes a solid contribution to the emerging field of personalized coding agents, though the generalizability of the approach needs further validation across diverse users and preference types.

Rating:6.5/ 10

Significance 7Rigor 5.5Novelty 6.5Clarity 7.5

Generated Jun 12, 2026

Comparison History (17)

Wonvs. Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

Paper 2 (TRACE) addresses a highly practical and timely problem—making LLM coding agents learn from user corrections persistently—with a novel compile-to-runtime-enforcement approach that shows strong empirical results. It has immediate real-world applicability as interactive AI agents become widely deployed, broad impact across HCI and software engineering, and introduces a new paradigm (compiling corrections into runtime checks) distinct from memory-based approaches. Paper 1 provides valuable empirical analysis of on-policy distillation mechanics but is more observational/analytical in nature, with narrower practical implications primarily for model compression practitioners.

claude-opus-4-6·Jun 12, 2026

Lostvs. Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score

Paper 1 introduces a fundamentally novel theoretical framework by bridging dynamical systems stability with neural network quantization. While Paper 2 offers a highly practical systems-level engineering solution for LLM agents, Paper 1 provides a mathematically rigorous, decoupled metric (TQS) that eliminates the need for calibration data. This theoretical innovation has deeper potential scientific impact, offering broad methodological advancements for efficient AI deployment across edge computing, control systems, and time-series forecasting.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Paper 2 addresses the critical bottleneck of high inference costs in large reasoning models. By enabling accurate NVFP4 quantization and significantly improving decoding latency, it offers substantial systemic improvements for deploying advanced AI at scale. While Paper 1 provides a highly practical UX improvement for coding agents, Paper 2's fundamental infrastructure optimization has a broader and more immediate impact across all applications relying on resource-intensive reasoning models.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

Paper 2 addresses a broadly relevant problem—making LLM coding agents learn persistently from user corrections—with a practical, open-source solution (TRACE) applicable across the rapidly growing LLM agent ecosystem. Its novelty in compiling natural-language corrections into runtime enforcement checks fills a clear gap (preference compliance vs. access), with strong empirical results and immediate real-world deployability. Paper 1, while methodologically sound, targets a narrower domain (maritime anomaly detection) with a more incremental contribution (rarity-gated FiLM conditioning), limiting its breadth of impact compared to Paper 2's timeliness and cross-field relevance.

claude-opus-4-6·Jun 12, 2026

Lostvs. Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Paper 2 likely has higher scientific impact due to a broadly relevant methodological contribution to online RL for flow-based generative models: replacing PPO ratio clipping with an exactly computable KL-based proximal constraint leveraging Gaussian per-step policies. This addresses a known structural mismatch, improves stability (multi-epoch training), mitigates forgetting, and supports multi-objective optimization—advances applicable across image/video generation and potentially other diffusion/flow frameworks. Paper 1 is practically valuable for coding-agent UX, but its impact is more application-layer and narrower in scope compared to a generally reusable optimization/training improvement for generative modeling.

gpt-5.2·Jun 12, 2026

Lostvs. MemNovo: Look Back at the Spectrum for Balanced De Novo Peptide Sequencing from Mass Spectrometry

Paper 1 addresses a fundamental methodological issue in de novo peptide sequencing—a core proteomics problem—with rigorous theoretical analysis (mutual information restoration) and strong empirical results (up to 39.1% improvement). Its training-free, plug-and-play nature makes it broadly applicable to existing Transformer-based models. Paper 2 presents a useful engineering contribution for coding agents but addresses a narrower usability concern with less fundamental scientific depth. Paper 1's impact spans computational biology and machine learning, offering deeper methodological insights with broader scientific implications.

claude-opus-4-6·Jun 12, 2026

Lostvs. Redesign Mixture-of-Experts Routers with Manifold Power Iteration

Paper 1 addresses a fundamental architectural component of large language models (MoE routers) with a mathematically rigorous redesign. Its theoretical grounding combined with large-scale empirical validation (up to 11B parameters) suggests it could broadly influence foundational model architectures. While Paper 2 offers a practical system for AI agents, Paper 1's contribution is more fundamental, potentially improving the efficiency and performance of a wide range of future foundation models, leading to greater scientific and practical impact.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

Paper 2 likely has higher scientific impact: it addresses a foundational, widely relevant issue in mechanistic interpretability—seed dependence and reproducibility of SAE features—using a scalable per-feature stability metric, extensive cross-condition experiments, geometric/subspace framing, and a synthetic model to establish mechanism. Its findings inform how SAEs should be evaluated, compared, and aggregated across runs, influencing interpretability, representation learning, and reliability/benchmarking practices across many labs. Paper 1 is practically valuable for coding agents, but is more application-specific and may generalize less broadly than Paper 2’s conceptual and methodological contribution.

gpt-5.2·Jun 12, 2026

Wonvs. Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

Paper 2 likely has higher impact due to strong real-world applicability and timeliness: it targets a pervasive deployment pain point (agents repeatedly violating user corrections across sessions) with a drop-in runtime enforcement mechanism. The approach is practical, measurable, and broadly relevant to coding agents, preference alignment, human-in-the-loop systems, and safety/compliance tooling, with open-source artifacts aiding adoption. Paper 1 is novel and methodologically interesting for RL-trainable latent reasoning plus interpretability, but its impact is more specialized to latent-reasoning model design and mechanistic analysis, with less immediate deployment pull.

gpt-5.2·Jun 12, 2026

Lostvs. Reinforcement Learning for Neural Model Editing

Paper 1 offers a foundational paradigm shift by formulating neural model editing as a reinforcement learning problem. This provides a generalized methodology for critical challenges like machine unlearning and bias mitigation across multiple modalities. While Paper 2 presents a highly practical and timely engineering solution for LLM agent memory compliance, Paper 1's algorithmic innovation has broader implications for deep learning theory, model safety, and alignment, giving it a higher potential for widespread scientific impact and foundational follow-up research.

gemini-3.1-pro-preview·Jun 12, 2026

#2870of 5669·cs.LG

#2870 of 5669 · cs.LG

Tournament Score

1400±46

10501750

53%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor5.5

Novelty6.5

Clarity7.5