Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

Yuxin Chen, Xiaodong Cai, Junfeng Fang, Zhuowen Han, Yu Wang, Yaorui Shi, Yi Zhang, Qi Gu

#1395 of 2682 · Artificial Intelligence
Share
Tournament Score
1405±43
10501800
59%
Win Rate
10
Wins
7
Losses
17
Matches
Rating
5.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well-controlled environments. To address this gap, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into the agent learning process. We identify two major sources of interaction noise in real-world scenarios: user noise, which captures ambiguity and variability in user interaction, and tool noise, which reflects failures and anomalies in tool execution. We introduce such perturbations into the training pipeline by modifying user interaction patterns and simulating tool execution results within the training environment. To stabilize training while encouraging agents to handle increasingly challenging imperfections, noise is applied to only a subset of rollouts and progressively increased in difficulty as the model adapts to the current noise level. Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments. Our analysis reveals that training under noise conditions also yields performance gains on idealized benchmarks, suggesting that controlled exposure to environmental noise promotes more generalizable reasoning and decision-making behaviors. Our findings highlight the importance of modeling interaction imperfections for bridging the gap between agent training and real-world deployment.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments"

1. Core Contribution

NoisyAgent proposes an agentic reinforcement learning framework that explicitly injects structured noise into the training pipeline of LLM-based agents to bridge the gap between idealized training environments and noisy real-world deployment. The key insight is that current agent training paradigms assume clean user instructions and reliable tool execution, while real deployments involve ambiguous users, inconsistent instructions, and unreliable tool outputs.

The framework contributes three interrelated components: (1) an automatic noise injection pipeline targeting user-side noise (ambiguous, inconsistent, redundant instructions) and tool-side noise (failures, incomplete outputs, misleading responses, redundancy); (2) a hybrid training scheme mixing clean and noisy rollouts with group-wise advantage normalization; and (3) an adaptive noise scheduler that progressively increases perturbation difficulty based on the model's measured robustness gap between clean and noisy rollouts.

2. Methodological Rigor

The methodology is well-structured and technically sound in its design. The formalization within a POMDP framework is appropriate, and the integration with GRPO-based optimization is clean. The group-wise advantage normalization (Equation 9) separating clean and noisy rollouts is a sensible design choice to prevent reward signal contamination.

However, several concerns arise:

  • Evaluation scope is limited. All experiments are conducted on only two benchmarks (τ2-Bench and VitaBench) and their noisy counterparts (AgentNoiseBench). The noisy benchmark (AgentNoiseBench) appears to be from the same research group (sharing multiple co-authors), raising questions about independent validation. The improvements, while consistent, are sometimes modest in absolute terms (e.g., 2.25→4.75 Avg@4 on OTA in Table 1).
  • Confounding factors in the training pipeline. The system employs GPT-4.1 for environment construction, Claude-Sonnet-4.5 as verifier, GLM-4.6 for instruction synthesis, and Qwen2.5-72B for noise injection and user simulation. This complex multi-model pipeline makes it difficult to isolate the contribution of the noise injection mechanism versus engineering choices in the pipeline.
  • Statistical reporting. While experiments are repeated 4 times, no confidence intervals or significance tests are reported. The Avg@4 and Pass@4 metrics, while informative, don't fully characterize the variance of results.
  • Threshold sensitivity. The scheduling threshold Δ=0.05 is presented without sensitivity analysis. How this threshold affects training stability and final performance is unclear.
  • 3. Potential Impact

    The paper addresses a genuinely important problem: real-world LLM agents face noisy, imperfect environments that current training paradigms ignore. The practical relevance is high, as tool-calling failures, ambiguous user queries, and incomplete API responses are ubiquitous in production systems.

    The finding that noise-aware training also improves performance on clean benchmarks (Table 2) is particularly noteworthy and practically valuable—it suggests this is not merely a robustness technique but a general training improvement, akin to data augmentation benefits observed in other domains.

    The framework is relatively general and could be applied to various agentic training pipelines. The noise taxonomy (user-side and tool-side) provides a useful conceptual framework for thinking about deployment-time failures.

    However, the impact is somewhat constrained by: (a) reliance on a specific environment scaling pipeline [34], limiting immediate reproducibility; (b) the high computational cost (32-64 H800 GPUs for 3-5 days); and (c) the fact that the core idea—training with noise improves robustness—is well-established in classical RL (domain randomization, adversarial training), and the paper's novelty lies primarily in applying and adapting this principle to LLM agent training.

    4. Timeliness & Relevance

    This paper is highly timely. LLM agents are being deployed rapidly in production (customer service, coding assistants, search), and robustness under imperfect conditions is a recognized bottleneck. The identification of the train-deploy gap for agentic systems is well-motivated and resonates with practical challenges.

    The work sits at the intersection of two active research areas: agentic RL training (DeepSeek-R1, GRPO/DAPO/GSPO) and agent robustness evaluation. By connecting these, it fills a genuine gap where most prior work focused on either training methods in clean settings or evaluating robustness without proposing training solutions.

    5. Strengths & Limitations

    Strengths:

  • Clear problem formulation with a well-motivated gap between training and deployment
  • Comprehensive noise taxonomy covering both user and tool sides
  • The hybrid training with group-wise normalization is a thoughtful design that addresses a real training stability challenge
  • The finding that noisy training improves clean performance is compelling and has practical implications
  • The interaction pattern analysis (Table 4) and case study provide useful qualitative insights into learned behaviors—agents make fewer unnecessary tool calls and produce more informative responses
  • Limitations:

  • The conceptual novelty is incremental—domain randomization and curriculum learning are well-established techniques being applied to a new domain
  • Narrow evaluation on two benchmarks from closely related work
  • The noise injection itself relies on a 72B parameter model (Qwen2.5-72B), which introduces its own biases about what constitutes "realistic" noise
  • No comparison with simpler robustness approaches (e.g., prompt-based augmentation, rejection sampling with noisy inputs)
  • The paper does not evaluate on truly open-ended or out-of-distribution real-world environments
  • Reproducibility concerns due to the complex multi-model pipeline and proprietary API dependencies (GPT-4.1, Claude-Sonnet-4.5)
  • Additional Observations

    The ablation study (Table 3) is conducted on only one domain (Delivery), limiting generalizability of component-level conclusions. The training dynamics analysis (Figure 2) is informative but only shown for one setting. The paper would benefit from analysis of failure modes—when does noise-aware training fail or provide diminishing returns?

    The broader framing as a "fundamental gap" is somewhat overstated given that this is essentially curriculum-based domain randomization applied to LLM agent training, but the execution is solid and the results are convincing within their scope.

    Rating:5.8/ 10
    Significance 6Rigor 5.5Novelty 4.5Clarity 7

    Generated May 27, 2026

    Comparison History (17)

    vs. Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents
    gpt-5.25/28/2026

    Paper 2 likely has higher impact due to broader applicability and timeliness: robustness to user/tool noise is a pervasive real-world deployment bottleneck for LLM agents, and a training framework that improves both noisy-setting performance and even ideal benchmarks can influence many agentic systems. Its approach is more directly actionable for improving deployed agents across domains (customer support, automation, tool APIs) and aligns with current focus on reliability. Paper 1 is novel and methodologically solid for evaluation/benchmarking feasibility awareness, but its primary impact is narrower (diagnostics/metrics) compared to a general training paradigm.

    vs. Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values
    gpt-5.25/28/2026

    Paper 2 is more novel and broadly impactful: it introduces a principled, span-level input-induced uncertainty decomposition using Shapley values with exact additivity, enabling actionable clarification guidance. This directly targets safety/trust in high-stakes LLM use (timely, high real-world relevance) and is applicable across domains (clinical, QA, dialogue, decision support). The methodology is grounded in information theory and cooperative game theory and is evaluated on multiple benchmarks plus a high-stakes setting. Paper 1 is practically useful for agent robustness, but the idea of training with noise/curricula is more incremental and less generalizable across fields.

    vs. SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models
    gemini-3.15/28/2026

    Paper 1 addresses a critical bottleneck in deploying LLMs in healthcare: governance, safety, and auditable reasoning. Its rigorous methodology, involving a traceable clinician-audited pipeline and direct comparison with medical residents, sets a strong precedent for safe clinical AI. While Paper 2 offers a valuable approach to general agent robustness, Paper 1 has higher potential for profound real-world impact by providing a viable pathway for integrating LLMs into high-stakes medical environments.

    vs. BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting
    claude-opus-4.65/27/2026

    Paper 2 addresses a fundamental and broadly applicable challenge—improving LLM agent robustness under real-world noisy conditions—which impacts the rapidly growing field of LLM-based agents across numerous domains. The finding that noise-augmented training also improves performance on clean benchmarks suggests deep insights about generalization. Paper 1, while methodologically solid for battery degradation forecasting, addresses a narrower application domain. Paper 2's framework (NoisyAgent) has broader cross-field impact potential given the ubiquity of LLM agent deployment, greater timeliness given the current explosion of agentic AI research, and more generalizable contributions.

    vs. LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
    gpt-5.25/27/2026

    Paper 1 likely has higher scientific impact because it proposes a general training framework (NoisyAgent) that addresses a core deployment gap for LLM agents—robustness to stochastic user/tool failures—applicable across many real-world agent settings (tool use, assistants, automation). It is method-oriented and potentially extensible as a standard training paradigm, affecting multiple domains beyond evaluation. Paper 2 is timely and useful as a benchmark for education-focused multimodal reasoning, but its impact is more field-specific and primarily evaluative rather than introducing broadly reusable methods.

    vs. MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning
    claude-opus-4.65/27/2026

    Paper 1 addresses a fundamental and broadly applicable problem—bridging the gap between idealized training and noisy real-world deployment for LLM agents. Its framework (NoisyAgent) is domain-agnostic and applicable across diverse agent tasks, giving it broader impact potential. The finding that noise-augmented training also improves performance on clean benchmarks suggests a generalizable principle. Paper 2, while valuable for clinical AI, targets a narrower domain (medical guideline reasoning) with a more incremental contribution (structured data augmentation from CPGs). Paper 1's timeliness is also higher given the rapid proliferation of LLM agents.

    vs. Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
    gpt-5.25/27/2026

    Paper 1 has higher likely impact due to timeliness (LLM agents), broad applicability (robustness to user/tool noise affects many deployed agent systems), and strong real-world relevance. Introducing structured noise curricula into agent training is a generally transferable idea across domains and tool-augmented settings, and the reported gains both under noise and on standard benchmarks suggest practical uptake. Paper 2 offers a thoughtful conceptual decomposition and analysis of PPO failure modes, but its demonstrated scope is narrower (specific cumulative-damage settings with calibrated career-style environments) and may influence a smaller slice of RL practice.

    vs. Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering
    claude-opus-4.65/27/2026

    Paper 1 addresses a fundamental and broadly applicable problem—bridging the gap between idealized training environments and noisy real-world deployment for LLM agents. Its framework (NoisyAgent) is generalizable across many agent applications, introduces a systematic noise injection methodology with curriculum-style training, and demonstrates improvements even on clean benchmarks, suggesting deep generalization benefits. Paper 2, while novel in its Gumbel-noise-based counterfactual generation for educational writing, targets a narrower application domain with more limited cross-field impact. Paper 1's contributions are more timely given the rapid deployment of LLM agents in real-world settings.

    vs. Hylos: Operability Contracts for Model-Native Spatial Intelligence
    gemini-3.15/27/2026

    While Paper 1 offers a practical and experimentally validated approach to improving LLM agent robustness via noise injection, Paper 2 introduces a paradigm-shifting systems architecture for spatial AI. By conceptualizing 3D edits as 'SpatialTransactions' governed by operability contracts, Paper 2 addresses a critical bottleneck bridging generative AI with robotics, CAD, and manufacturing. This fundamental reframing of spatial intelligence from visual plausibility to functional operability offers a higher ceiling for long-term, cross-disciplinary scientific impact.

    vs. AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
    gemini-3.15/27/2026

    Paper 2 addresses a critical bottleneck in the widespread deployment of LLM agents: robustness to real-world noise and stochasticity. While Paper 1 provides a valuable benchmark for the specific domain of audio-video generation, Paper 2's methodology for enhancing agent resilience to user and tool noise has broader applicability across numerous fields. Improving generalization and decision-making in imperfect environments bridges a fundamental gap between idealized training and practical AI deployment, offering higher potential for widespread cross-disciplinary impact.

    vs. Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World
    gemini-3.15/27/2026

    Paper 2 introduces a highly ambitious benchmark that defines a new frontier for LLM agents: always-on, cross-device personal assistance with long-horizon context. By expanding the evaluation scope to include proactive assistance and rich contextual noise, it is likely to drive future research directions and serve as a standard for next-generation agents, offering broader field impact than the robust training methodology proposed in Paper 1.

    vs. Associations between echocardiographic traits and AI-ECG predictions of heart failure
    claude-opus-4.65/27/2026

    Paper 1 addresses a fundamental and broadly impactful problem in AI agent deployment—robustness to real-world noise—proposing a novel training framework (NoisyAgent) with progressive noise injection. This has wide applicability across the rapidly growing LLM agent ecosystem. Paper 2, while clinically valuable, is primarily a correlational/validation study of an existing AI-ECG model against echocardiographic measures, offering incremental interpretability insights rather than a new methodology. Paper 1's novelty, broader applicability across AI/ML fields, and timeliness in the booming agent space give it higher potential impact.

    vs. MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games
    gpt-5.25/27/2026

    Paper 1 targets a broadly important, timely problem: robustness of LLM-based agents under realistic interaction noise (user ambiguity, tool failures). The approach is conceptually general and likely transferable across many agentic systems and application domains, with clear real-world deployment relevance. If validated rigorously, training with progressively increased stochastic perturbations could influence standard agent training/evaluation practices beyond a single task family. Paper 2 is methodologically solid and impactful within imperfect-information game AI, but its scope and cross-domain applicability are narrower than robustness training for LLM agents.

    vs. Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks
    gpt-5.25/27/2026

    Paper 2 has higher potential impact because it identifies a pervasive, systemic bias in widely used production benchmarking setups and provides both a theoretical (M/G/1 queuing + GIL effects) and practical mitigation (multi-process framework) with a standardized composite metric (NTPOT). This directly affects reproducibility and correctness of LLM performance claims across industry and academia, with immediate real-world applicability to SLO-driven deployments and broad relevance to systems/ML evaluation. Paper 1 is valuable but is a more incremental extension of robustness training via noise injection, with narrower cross-field reach.

    vs. The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context
    claude-opus-4.65/27/2026

    Paper 1 identifies a fundamental, previously unnamed problem ('attribution blind spot') in RAG systems critical for AI safety and trustworthiness. It introduces a novel framework (CRM) grounded in cognitive science, with rigorous methodology across 9 model variants and 3 families. The problem it addresses—verifying whether LLM outputs are actually grounded in retrieved evidence—is essential for high-stakes AI deployment and has broad implications for interpretability, factuality, and AI governance. Paper 2 addresses a practical but more incremental contribution (adding noise to agent training), which is a well-explored paradigm in ML robustness.

    vs. Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning
    gemini-3.15/27/2026

    Paper 2 addresses a fundamental flaw in how the AI community evaluates multi-hop reasoning, identifying 'composition collapse' and proposing a novel diagnostic protocol. While Paper 1 offers a practical training framework for agent robustness, Paper 2's methodological insights are likely to shift evaluation paradigms across the broader field of LLM reasoning, leading to a deeper and wider scientific impact.

    vs. Solving Combinatorial Counting Problems with Weighted First-Order Model Counting
    claude-opus-4.65/27/2026

    Paper 1 addresses a critical and timely problem—bridging the gap between idealized training and real-world deployment of LLM-based agents—which affects the rapidly growing field of AI agents. Its NoisyAgent framework is broadly applicable across diverse agent tasks and demonstrates improved robustness and generalization. Paper 2 presents a niche but well-crafted contribution to combinatorial counting via WFOMC, but its impact is narrower, targeting a specialized community. Paper 1's relevance to the dominant LLM-agent paradigm, practical applicability, and potential to influence training methodologies across the field give it higher estimated impact.