Evaluating Agentic Configuration Repair for Computer Networks

Rufat Asadli, Benjamin Hoffman, Ioannis Protogeros, Laurent Vanbever

#2768 of 3355 · Artificial Intelligence
Share
Tournament Score
1311±46
10501800
37%
Win Rate
7
Wins
12
Losses
19
Matches
Rating
5.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Misconfigurations in computer networks remain a major source of critical Internet outages. Research is turning to Large Language Models (LLMs) to automate the complex, error-prone task of network configuration. However, even state-of-the-art models fail to resolve misconfigurations in large-scale, complex scenarios and often introduce new errors. In this work, we benchmark open- and closed-source LLMs augmented with formal network verification and context retrieval tools. We demonstrate that agentic architectures outperform base LLMs in repair efficacy (by 12% on average) and safety (by 17% on average), enabled by the ability to dynamically manage context and iteratively validate configuration repairs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper proposes and evaluates a minimal agentic framework for automated network configuration repair, built on a ReAct-style agent equipped with three custom tools: dynamic context retrieval (selective inspection of router configurations), iterative search-and-replace editing with rollback, and formal verification feedback via Batfish. The work benchmarks four LLMs (GPT-5 Mini, GLM-4.7, Qwen3.5-9B, Gemma4 E4B) on the CORNETTO benchmark's 231 misconfiguration scenarios, demonstrating that the agentic approach improves fix scores by ~12% and reduces regression rates by ~17% on average compared to monolithic (single-turn) prompting.

The core insight is straightforward but valuable: giving LLMs the ability to iteratively retrieve context, apply patches, verify results, and rollback mistakes naturally addresses the three key challenges of network configuration repair — noise in large configurations, single-shot failure modes, and difficulty predicting network-wide forwarding consequences.

Methodological Rigor

The experimental methodology is reasonably sound for a workshop paper but has notable limitations:

Strengths in evaluation design: The paper uses a well-defined benchmark (CORNETTO) with 231 scenarios across real-world topologies from Topology Zoo, spanning 27 fault types. The evaluation metrics (fix score and regression rate) are clearly defined and meaningful — capturing both efficacy and safety. The ablation study isolating verifier feedback and context retrieval provides useful mechanistic understanding.

Concerns: The paper evaluates only four models with a single run configuration (temperature 0.7), without reporting variance or confidence intervals on the main results (Fig. 2). The 95% CI bands in Fig. 3 are only shown for the step-budget analysis. The ablation in Table 1 is limited to two models, and the interactions between design choices are acknowledged but not systematically explored. The comparison against monolithic baselines inherits CORNETTO's random context sampling strategy, which means the monolithic baseline may not represent the strongest possible single-turn approach. Additionally, absolute fix scores remain modest — even the best agentic configuration achieves only ~49% — suggesting substantial room for improvement.

The paper honestly acknowledges an interesting negative finding: agents achieve lower diagnosis scores than monolithic approaches due to higher false-positive rates (lower precision despite comparable recall), which adds credibility.

Potential Impact

Near-term practical relevance: Network misconfiguration is genuinely a major cause of Internet outages, as evidenced by well-documented incidents at Meta and Cloudflare. The paper connects to real industry efforts (ByteDance's NetAssistant, Alibaba's BiAn, Meta's Confucius), positioning the work in an active deployment trajectory. The cost analysis (Appendix A.3) showing ~5x cost increase for agentic approaches is practically relevant for deployment decisions.

Research impact: The paper provides a useful data point for the broader question of how agentic scaffolding improves LLM performance on domain-specific tasks. The finding that open-source models benefit disproportionately (up to 7x fix score improvement) is particularly interesting — suggesting that agentic architectures can partially compensate for weaker base capabilities. The observation that verification feedback makes GPT-5 Mini more conservative (lower fix score but also lower regression) while helping Qwen3.5-9B on both dimensions reveals nuanced model-dependent dynamics worth further investigation.

Broader influence: The framework of combining formal verification tools with LLM agents is applicable beyond networking — to any domain where automated verification can provide reliable feedback signals (hardware design, smart contracts, etc.). However, the specific tools and prompts are highly domain-specific.

Timeliness & Relevance

The paper is well-timed, sitting at the intersection of two active research areas: LLM-based agentic systems and automated network management. The workshop venue (Agents in the Wild at ICML 2026) is appropriate. The paper directly builds on CORNETTO (2026) and references concurrent work (NETARENA, NIKA), establishing it within a rapidly developing ecosystem. The emphasis on safety metrics (regression rate) addresses a genuine deployment barrier.

Strengths & Limitations

Key Strengths:

1. Clear problem formulation with well-motivated design choices tied to specific failure modes of monolithic approaches

2. Systematic ablation revealing non-obvious trade-offs (e.g., prefilled vs. dynamic context retrieval depends on model capability)

3. Practical considerations including cost analysis and step-budget saturation analysis

4. The tool trajectory analysis (Fig. 4, Fig. 7) provides insight into emergent agent behavior

5. Honest reporting of negative results (diagnosis quality degradation)

Notable Limitations:

1. Incremental novelty: The agentic architecture is a relatively straightforward application of ReAct with domain-specific tools. The individual components (context retrieval, iterative editing, verification feedback) are not themselves novel

2. Modest absolute performance: The best configuration achieves ~49% fix score, meaning more than half of scenarios remain unresolved

3. Limited model coverage and statistical rigor: Four models without repeated trials limits generalizability claims

4. No comparison to other agentic baselines: The paper compares only against monolithic prompting; comparison with alternative agent architectures (plan-and-execute, tree-of-thought, etc.) would strengthen the contribution

5. Scalability questions: The paper doesn't deeply analyze how performance varies with network size within the benchmark, though networks range up to 754 nodes

6. Reproducibility: While prompts are provided in full (a positive), the reliance on proprietary models limits full reproducibility

Overall Assessment

This is a solid workshop paper that applies agentic LLM architectures to an important practical problem and provides useful empirical evidence that iterative tool use improves both efficacy and safety of network configuration repair. The contribution is primarily empirical and engineering-oriented rather than conceptually novel, but the domain-specific insights (particularly around the model-dependent effects of verification feedback and context management) are valuable for the networking and AI agents communities. The paper would benefit from deeper statistical analysis, broader model coverage, and comparison against alternative agent designs.

Rating:5.5/ 10
Significance 5.5Rigor 5Novelty 4.5Clarity 7.5

Generated Jun 5, 2026

Comparison History (19)

vs. Front-to-Attractors: Modifying the Front-to-Front Heuristic in Bidirectional Search
claude-opus-4.66/8/2026

Paper 2 addresses the timely and high-impact intersection of LLMs with network reliability, a critical real-world problem affecting Internet infrastructure. It combines formal verification with agentic AI architectures, contributing to the rapidly growing field of LLM-based automation. Its broader relevance across AI, networking, and software engineering communities, plus immediate practical applicability, gives it higher potential impact. Paper 1, while technically sound, addresses a more niche topic in bidirectional heuristic search with incremental improvements that appeal to a narrower audience in classical AI planning/search.

vs. Accounting for Context: Shaping Moral Credences for Value Alignment
gpt-5.26/8/2026

Paper 1 likely has higher impact due to clear methodological grounding (benchmarking multiple LLMs with formal verification and retrieval), measurable improvements, and immediate real-world applicability to preventing network outages—a high-cost, high-stakes domain. It is timely given rapid adoption of LLM agents for operational automation and could influence both networking and AI safety/verification practices. Paper 2 offers a novel theoretical critique in value alignment and moral uncertainty, but its impact is more specialized and may translate more slowly into deployable systems or empirical research.

vs. StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents
gemini-3.16/8/2026

Paper 1 introduces a highly novel, biologically/network-inspired methodology (StainFlow) to solve a fundamental problem in RL (credit assignment in long-horizon tasks). While Paper 2 addresses a critical real-world problem, it is primarily a benchmarking and evaluation study of existing LLM architectures. Paper 1's methodological innovation offers broader potential impact across the rapidly growing field of autonomous AI agents.

vs. How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment
claude-opus-4.66/6/2026

Paper 1 has higher potential scientific impact due to its novelty, timeliness, and breadth of relevance. It analyzes a unique, naturally-occurring dataset from a controversial real-world deployment of covert LLM agents in deliberative forums—a rare empirical opportunity unlikely to be replicated ethically. Its findings about persuasive AI tactics, synthetic identity performance, and cognitive bias exploitation have broad implications for AI governance, platform integrity, democratic deliberation, and policy. Paper 2 addresses an important but narrower engineering problem (network configuration repair) with incremental improvements, and its impact is more domain-specific.

vs. PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation
gemini-3.16/6/2026

Paper 1 presents a novel two-stage training methodology and framework for generating human-aligned synthetic users, offering broad applications across HCI and software development. Paper 2, while addressing a critical infrastructure problem, is primarily a benchmarking study of existing agentic architectures. The methodological innovation and wider potential applicability of Paper 1 give it a higher potential scientific impact.

vs. Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection
gemini-3.16/6/2026

Paper 2 addresses a fundamental challenge in causal inference and data analysis, offering broad applicability across any scientific discipline that utilizes observational data. By bridging causal discovery with practical feature selection, it has the potential to fundamentally alter machine learning practices. While Paper 1 provides a valuable and highly practical application of LLM agents in networking, its impact is relatively confined to network management, making Paper 2's foundational contribution more widely impactful.

vs. Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation
gpt-5.26/6/2026

Paper 2 has higher likely scientific impact due to strong real-world relevance (preventing major network outages), clear practical applicability, and methodological rigor via benchmarking across models plus formal verification and tool-augmented agentic workflows with quantified gains in efficacy and safety. Its results are timely for LLMs-in-systems and can influence networking, reliability engineering, and agentic AI evaluation. Paper 1 is novel for social simulation interpretability (private state vs public speech), but impact may be narrower, more conceptual, and harder to validate against real-world ground truth.

vs. RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit
claude-opus-4.66/5/2026

Paper 2 addresses a critical real-world problem (network misconfigurations causing Internet outages) with a practical agentic approach combining LLMs with formal verification tools. It demonstrates concrete, measurable improvements in repair efficacy and safety, has immediate applicability to network operations, and contributes to the growing field of LLM-based agentic systems. Paper 1, while methodologically thorough, is more niche—focused on community-conditioned LLM adaptation from Reddit—with narrower applicability and less immediate real-world impact. Paper 2's intersection of formal methods with LLM agents has broader relevance across AI and systems research.

vs. Framing, Judging, Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AI
gpt-5.26/5/2026

Paper 2 is likely to have higher scientific impact due to direct applicability to a high-stakes, widely relevant infrastructure problem (network outages) and clear, measurable gains from an agentic + formal verification approach. The combination of LLMs with verification/context tools is methodologically grounded and aligns with current trends in reliable AI systems, potentially influencing both networking and AI-safety/tool-use research. Paper 1 is novel for education/assessment of AI reasoning, but its evidence is currently mostly simulated and pending human validation, which may limit near-term impact.

vs. DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance
gemini-3.16/5/2026

Paper 2 addresses a fundamental limitation of autoregressive decoding (early commitment) in LLM tool planning by introducing a novel diffusion-based approach. This methodological innovation has broader applicability across various AI domains and tasks compared to Paper 1, which primarily focuses on benchmarking existing LLM agent architectures for a specific applied problem (network configuration repair).

vs. Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement
gpt-5.26/5/2026

Paper 2 likely has higher impact due to a more novel, generally applicable agentic paradigm (global blueprint generation/refinement) and exceptionally strong, state-of-the-art results across multiple challenging benchmarks (MiniF2F, PutnamBench, IMO/Putnam/USAMO) with large cost advantages. Its applications span automated reasoning, math, verification, and software/hardware assurance, giving broader cross-field relevance. Paper 1 is timely and practically important for networking reliability, but the contribution appears more incremental (benchmarking/tool-augmentation with modest gains) and narrower in scope.

vs. Where does Absolute Position come from in decoder-only Transformers?
gemini-3.16/5/2026

Paper 2 provides fundamental insights into Transformer mechanics, explaining how absolute positional information emerges in models using only relative encodings. This deep understanding of attention sinks and architecture directly impacts the broader AI community's approach to designing and interpreting foundation models. While Paper 1 presents a valuable and practical application of LLMs for network configuration, its scientific impact is narrower and more localized to the systems and networking fields compared to the foundational nature of Paper 2.

vs. VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark
gemini-3.16/5/2026

Paper 1 addresses a critical real-world problem—preventing major Internet outages caused by network misconfigurations. By combining LLM agents with formal verification, it offers a practical, high-impact solution to a high-stakes engineering challenge. In contrast, Paper 2 introduces a useful but more niche evaluation benchmark for multimodal LLMs in math problem-solving. The direct application to critical infrastructure and demonstrated improvements in safety and efficacy give Paper 1 a higher potential for immediate, broad scientific and real-world impact.

vs. PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models
gemini-3.16/5/2026

Paper 2 addresses a critical challenge in end-to-end autonomous driving—interpreting and controlling latent world models for safe trajectory planning. By introducing a style-conditioned semantic cost map, it improves both safety (42% collision rate reduction) and interpretability in a high-stakes, rapidly advancing field. While Paper 1 offers a practical application of LLM agents, Paper 2's methodological innovation in world models and its direct implications for autonomous vehicle safety suggest a broader and more significant scientific and real-world impact.

vs. GITCO: Gated Inference-Time Context Optimization in TSFMs
gemini-3.16/5/2026

Paper 1 introduces a novel inference-time optimization framework and a new theoretical property (context sensitivity profiles) for time series foundation models, offering fundamental methodological contributions. Paper 2, while practically valuable, primarily benchmarks existing agentic LLM techniques for a specific application, offering less foundational scientific innovation.

vs. AdaMEM: Test-Time Adaptive Memory for Language Agents
gemini-3.16/5/2026

Paper 2 introduces a novel, general-purpose framework for test-time adaptive memory in language agents, addressing a fundamental challenge in AI with broad applicability across multiple domains. In contrast, Paper 1 primarily focuses on benchmarking existing LLM approaches within the specific, narrower domain of network configuration repair. Paper 2's methodological innovation and potential for widespread adoption across various agentic systems give it a higher estimated scientific impact.

vs. Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models
claude-opus-4.66/5/2026

Paper 1 addresses a fundamental efficiency bottleneck in Large Reasoning Models through a novel observation about decision-critical tokens and proposes a general-purpose KV cache optimization method (DynTS). This has broad applicability across all LRM deployments, touching inference efficiency—a critical bottleneck for the entire field. Paper 2, while practically useful, is more application-specific (network configuration repair) and primarily benchmarks existing approaches (agentic LLMs with verification tools) rather than introducing a fundamentally new method. Paper 1's insights about reasoning token importance and the resulting efficiency gains have wider cross-domain impact.

vs. TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management
claude-opus-4.66/5/2026

Paper 2 addresses the critical and timely problem of automated network configuration repair using LLM agents augmented with formal verification, demonstrating clear practical impact on network reliability. It benchmarks both open and closed-source LLMs with agentic architectures, showing meaningful improvements in repair efficacy and safety. The combination of LLMs with formal verification tools is a compelling and broadly applicable paradigm. Paper 1, while technically interesting, addresses a narrower infrastructure concern (context management for LLMs) with a relatively small evaluation (21 sessions) and moderate recall numbers, limiting its broader impact.

vs. When Should We Protect AI? A Precautionary Framework for Consciousness Uncertainty
claude-opus-4.66/5/2026

Paper 1 addresses a fundamentally novel and increasingly urgent question—how to translate AI consciousness assessments into actionable protective obligations—offering a comprehensive precautionary framework with broad interdisciplinary relevance spanning philosophy, AI ethics, policy, and consciousness science. As AI systems grow more sophisticated, this framework could shape regulation and industry practice globally. Paper 2, while practically useful, represents an incremental engineering contribution (benchmarking LLMs for network configuration repair) within a narrower domain, with more limited potential to influence fields beyond network operations.