Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments
Yuxin Chen, Xiaodong Cai, Junfeng Fang, Zhuowen Han, Yu Wang, Yaorui Shi, Yi Zhang, Qi Gu
Abstract
Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well-controlled environments. To address this gap, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into the agent learning process. We identify two major sources of interaction noise in real-world scenarios: user noise, which captures ambiguity and variability in user interaction, and tool noise, which reflects failures and anomalies in tool execution. We introduce such perturbations into the training pipeline by modifying user interaction patterns and simulating tool execution results within the training environment. To stabilize training while encouraging agents to handle increasingly challenging imperfections, noise is applied to only a subset of rollouts and progressively increased in difficulty as the model adapts to the current noise level. Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments. Our analysis reveals that training under noise conditions also yields performance gains on idealized benchmarks, suggesting that controlled exposure to environmental noise promotes more generalizable reasoning and decision-making behaviors. Our findings highlight the importance of modeling interaction imperfections for bridging the gap between agent training and real-world deployment.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments"
1. Core Contribution
NoisyAgent proposes an agentic reinforcement learning framework that explicitly injects structured noise into the training pipeline of LLM-based agents to bridge the gap between idealized training environments and noisy real-world deployment. The key insight is that current agent training paradigms assume clean user instructions and reliable tool execution, while real deployments involve ambiguous users, inconsistent instructions, and unreliable tool outputs.
The framework contributes three interrelated components: (1) an automatic noise injection pipeline targeting user-side noise (ambiguous, inconsistent, redundant instructions) and tool-side noise (failures, incomplete outputs, misleading responses, redundancy); (2) a hybrid training scheme mixing clean and noisy rollouts with group-wise advantage normalization; and (3) an adaptive noise scheduler that progressively increases perturbation difficulty based on the model's measured robustness gap between clean and noisy rollouts.
2. Methodological Rigor
The methodology is well-structured and technically sound in its design. The formalization within a POMDP framework is appropriate, and the integration with GRPO-based optimization is clean. The group-wise advantage normalization (Equation 9) separating clean and noisy rollouts is a sensible design choice to prevent reward signal contamination.
However, several concerns arise:
3. Potential Impact
The paper addresses a genuinely important problem: real-world LLM agents face noisy, imperfect environments that current training paradigms ignore. The practical relevance is high, as tool-calling failures, ambiguous user queries, and incomplete API responses are ubiquitous in production systems.
The finding that noise-aware training also improves performance on clean benchmarks (Table 2) is particularly noteworthy and practically valuable—it suggests this is not merely a robustness technique but a general training improvement, akin to data augmentation benefits observed in other domains.
The framework is relatively general and could be applied to various agentic training pipelines. The noise taxonomy (user-side and tool-side) provides a useful conceptual framework for thinking about deployment-time failures.
However, the impact is somewhat constrained by: (a) reliance on a specific environment scaling pipeline [34], limiting immediate reproducibility; (b) the high computational cost (32-64 H800 GPUs for 3-5 days); and (c) the fact that the core idea—training with noise improves robustness—is well-established in classical RL (domain randomization, adversarial training), and the paper's novelty lies primarily in applying and adapting this principle to LLM agent training.
4. Timeliness & Relevance
This paper is highly timely. LLM agents are being deployed rapidly in production (customer service, coding assistants, search), and robustness under imperfect conditions is a recognized bottleneck. The identification of the train-deploy gap for agentic systems is well-motivated and resonates with practical challenges.
The work sits at the intersection of two active research areas: agentic RL training (DeepSeek-R1, GRPO/DAPO/GSPO) and agent robustness evaluation. By connecting these, it fills a genuine gap where most prior work focused on either training methods in clean settings or evaluating robustness without proposing training solutions.
5. Strengths & Limitations
Strengths:
Limitations:
Additional Observations
The ablation study (Table 3) is conducted on only one domain (Delivery), limiting generalizability of component-level conclusions. The training dynamics analysis (Figure 2) is informative but only shown for one setting. The paper would benefit from analysis of failure modes—when does noise-aware training fail or provide diminishing returns?
The broader framing as a "fundamental gap" is somewhat overstated given that this is essentially curriculum-based domain randomization applied to LLM agent training, but the execution is solid and the results are convincing within their scope.
Generated May 27, 2026
Comparison History (17)
Paper 2 likely has higher impact due to broader applicability and timeliness: robustness to user/tool noise is a pervasive real-world deployment bottleneck for LLM agents, and a training framework that improves both noisy-setting performance and even ideal benchmarks can influence many agentic systems. Its approach is more directly actionable for improving deployed agents across domains (customer support, automation, tool APIs) and aligns with current focus on reliability. Paper 1 is novel and methodologically solid for evaluation/benchmarking feasibility awareness, but its primary impact is narrower (diagnostics/metrics) compared to a general training paradigm.
Paper 2 is more novel and broadly impactful: it introduces a principled, span-level input-induced uncertainty decomposition using Shapley values with exact additivity, enabling actionable clarification guidance. This directly targets safety/trust in high-stakes LLM use (timely, high real-world relevance) and is applicable across domains (clinical, QA, dialogue, decision support). The methodology is grounded in information theory and cooperative game theory and is evaluated on multiple benchmarks plus a high-stakes setting. Paper 1 is practically useful for agent robustness, but the idea of training with noise/curricula is more incremental and less generalizable across fields.
Paper 1 addresses a critical bottleneck in deploying LLMs in healthcare: governance, safety, and auditable reasoning. Its rigorous methodology, involving a traceable clinician-audited pipeline and direct comparison with medical residents, sets a strong precedent for safe clinical AI. While Paper 2 offers a valuable approach to general agent robustness, Paper 1 has higher potential for profound real-world impact by providing a viable pathway for integrating LLMs into high-stakes medical environments.
Paper 2 addresses a fundamental and broadly applicable challenge—improving LLM agent robustness under real-world noisy conditions—which impacts the rapidly growing field of LLM-based agents across numerous domains. The finding that noise-augmented training also improves performance on clean benchmarks suggests deep insights about generalization. Paper 1, while methodologically solid for battery degradation forecasting, addresses a narrower application domain. Paper 2's framework (NoisyAgent) has broader cross-field impact potential given the ubiquity of LLM agent deployment, greater timeliness given the current explosion of agentic AI research, and more generalizable contributions.
Paper 1 likely has higher scientific impact because it proposes a general training framework (NoisyAgent) that addresses a core deployment gap for LLM agents—robustness to stochastic user/tool failures—applicable across many real-world agent settings (tool use, assistants, automation). It is method-oriented and potentially extensible as a standard training paradigm, affecting multiple domains beyond evaluation. Paper 2 is timely and useful as a benchmark for education-focused multimodal reasoning, but its impact is more field-specific and primarily evaluative rather than introducing broadly reusable methods.
Paper 1 addresses a fundamental and broadly applicable problem—bridging the gap between idealized training and noisy real-world deployment for LLM agents. Its framework (NoisyAgent) is domain-agnostic and applicable across diverse agent tasks, giving it broader impact potential. The finding that noise-augmented training also improves performance on clean benchmarks suggests a generalizable principle. Paper 2, while valuable for clinical AI, targets a narrower domain (medical guideline reasoning) with a more incremental contribution (structured data augmentation from CPGs). Paper 1's timeliness is also higher given the rapid proliferation of LLM agents.
Paper 1 has higher likely impact due to timeliness (LLM agents), broad applicability (robustness to user/tool noise affects many deployed agent systems), and strong real-world relevance. Introducing structured noise curricula into agent training is a generally transferable idea across domains and tool-augmented settings, and the reported gains both under noise and on standard benchmarks suggest practical uptake. Paper 2 offers a thoughtful conceptual decomposition and analysis of PPO failure modes, but its demonstrated scope is narrower (specific cumulative-damage settings with calibrated career-style environments) and may influence a smaller slice of RL practice.
Paper 1 addresses a fundamental and broadly applicable problem—bridging the gap between idealized training environments and noisy real-world deployment for LLM agents. Its framework (NoisyAgent) is generalizable across many agent applications, introduces a systematic noise injection methodology with curriculum-style training, and demonstrates improvements even on clean benchmarks, suggesting deep generalization benefits. Paper 2, while novel in its Gumbel-noise-based counterfactual generation for educational writing, targets a narrower application domain with more limited cross-field impact. Paper 1's contributions are more timely given the rapid deployment of LLM agents in real-world settings.
While Paper 1 offers a practical and experimentally validated approach to improving LLM agent robustness via noise injection, Paper 2 introduces a paradigm-shifting systems architecture for spatial AI. By conceptualizing 3D edits as 'SpatialTransactions' governed by operability contracts, Paper 2 addresses a critical bottleneck bridging generative AI with robotics, CAD, and manufacturing. This fundamental reframing of spatial intelligence from visual plausibility to functional operability offers a higher ceiling for long-term, cross-disciplinary scientific impact.
Paper 2 addresses a critical bottleneck in the widespread deployment of LLM agents: robustness to real-world noise and stochasticity. While Paper 1 provides a valuable benchmark for the specific domain of audio-video generation, Paper 2's methodology for enhancing agent resilience to user and tool noise has broader applicability across numerous fields. Improving generalization and decision-making in imperfect environments bridges a fundamental gap between idealized training and practical AI deployment, offering higher potential for widespread cross-disciplinary impact.
Paper 2 introduces a highly ambitious benchmark that defines a new frontier for LLM agents: always-on, cross-device personal assistance with long-horizon context. By expanding the evaluation scope to include proactive assistance and rich contextual noise, it is likely to drive future research directions and serve as a standard for next-generation agents, offering broader field impact than the robust training methodology proposed in Paper 1.
Paper 1 addresses a fundamental and broadly impactful problem in AI agent deployment—robustness to real-world noise—proposing a novel training framework (NoisyAgent) with progressive noise injection. This has wide applicability across the rapidly growing LLM agent ecosystem. Paper 2, while clinically valuable, is primarily a correlational/validation study of an existing AI-ECG model against echocardiographic measures, offering incremental interpretability insights rather than a new methodology. Paper 1's novelty, broader applicability across AI/ML fields, and timeliness in the booming agent space give it higher potential impact.
Paper 1 targets a broadly important, timely problem: robustness of LLM-based agents under realistic interaction noise (user ambiguity, tool failures). The approach is conceptually general and likely transferable across many agentic systems and application domains, with clear real-world deployment relevance. If validated rigorously, training with progressively increased stochastic perturbations could influence standard agent training/evaluation practices beyond a single task family. Paper 2 is methodologically solid and impactful within imperfect-information game AI, but its scope and cross-domain applicability are narrower than robustness training for LLM agents.
Paper 2 has higher potential impact because it identifies a pervasive, systemic bias in widely used production benchmarking setups and provides both a theoretical (M/G/1 queuing + GIL effects) and practical mitigation (multi-process framework) with a standardized composite metric (NTPOT). This directly affects reproducibility and correctness of LLM performance claims across industry and academia, with immediate real-world applicability to SLO-driven deployments and broad relevance to systems/ML evaluation. Paper 1 is valuable but is a more incremental extension of robustness training via noise injection, with narrower cross-field reach.
Paper 1 identifies a fundamental, previously unnamed problem ('attribution blind spot') in RAG systems critical for AI safety and trustworthiness. It introduces a novel framework (CRM) grounded in cognitive science, with rigorous methodology across 9 model variants and 3 families. The problem it addresses—verifying whether LLM outputs are actually grounded in retrieved evidence—is essential for high-stakes AI deployment and has broad implications for interpretability, factuality, and AI governance. Paper 2 addresses a practical but more incremental contribution (adding noise to agent training), which is a well-explored paradigm in ML robustness.
Paper 2 addresses a fundamental flaw in how the AI community evaluates multi-hop reasoning, identifying 'composition collapse' and proposing a novel diagnostic protocol. While Paper 1 offers a practical training framework for agent robustness, Paper 2's methodological insights are likely to shift evaluation paradigms across the broader field of LLM reasoning, leading to a deeper and wider scientific impact.
Paper 1 addresses a critical and timely problem—bridging the gap between idealized training and real-world deployment of LLM-based agents—which affects the rapidly growing field of AI agents. Its NoisyAgent framework is broadly applicable across diverse agent tasks and demonstrates improved robustness and generalization. Paper 2 presents a niche but well-crafted contribution to combinatorial counting via WFOMC, but its impact is narrower, targeting a specialized community. Paper 1's relevance to the dominant LLM-agent paradigm, practical applicability, and potential to influence training methodologies across the field give it higher estimated impact.