Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies

Wei Zheng, Yang Yan, Yiyang Shao, Jinyang Li, Zeze Chang, Yukuang Jia, Qiming Mao, Chihyung Wang

May 28, 2026

arXiv:2605.29270v1 PDF

cs.AI(primary)

#1366of 2821·Artificial Intelligence

#1366 of 2821 · Artificial Intelligence

Tournament Score

1413±49

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6.5

Rigor5.5

Novelty6.5

Clarity7.5

Tournament Score

1413±49

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

The era of the Internet of Agents (IoA) is taking shape: LLM agents are expected to fulfill user goals by orchestrating fast-growing populations of Model Context Protocol (MCP) servers, Agent-to-Agent (A2A) endpoints, reusable skills, and other LLM-callable services. Yet LLMs face a structural mismatch with this regime: effective context is a scarce resource that does not scale with the number of services. Concatenating thousands of service descriptions into a prompt overflows the context window, and even when the window is large enough, models systematically under-attend to information in the middle of long inputs, the well-documented Lost-in-the-Middle phenomenon. This is fundamentally a question of context management for service discovery. To address this, we propose an LLM-native progressive-disclosure scheme and its concrete instantiation, A2X (Agent-to-Anything service discovery): an LLM-driven pipeline that automatically organizes the registered services into a hierarchical taxonomy and walks it layer by layer at query time, so that every LLM call sees only a small candidate set highly relevant to the user query. This decouples effective-context scarcity from registry size and significantly reduces token consumption while improving retrieval accuracy. Compared to full-context dumping, A2X achieves a 6.2-point Hit Rate gain at one-ninth the prompt-token cost; compared to the state-of-the-art open-source embedding-based baseline, A2X improves Hit Rate by more than 20 points.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies"

1. Core Contribution

The paper introduces A2X (Agent-to-Anything), a system that addresses the service discovery problem for LLM agents by automatically constructing a hierarchical taxonomy of services and navigating it via progressive disclosure at query time. The key insight is reframing service discovery as a context management problem: rather than dumping thousands of service descriptions into a single LLM prompt (which causes token bloat and Lost-in-the-Middle degradation) or relying on embedding-based retrieval (which sacrifices semantic understanding), A2X builds a tree structure offline using BFS-based recursive splitting, then traverses it at query time through a sequence of short, focused LLM calls. Each call sees only ~8-15 candidates rather than the full registry.

The contribution sits at the intersection of LLM reasoning, information retrieval, and agent systems. The conceptual framing — viewing the LLM's effective context as a scarce resource to be managed via hierarchical decomposition — is clean and well-motivated.

2. Methodological Rigor

Strengths in methodology:

The BFS construction algorithm (Algorithm 1) is well-specified with clear handling of edge cases: keyword-first compression for large nodes, single-axis constraints to prevent axis mixing, boundary clauses for disambiguation, and a refinement loop.

The ablation study (Study 2, Table 2) systematically isolates contributions of each build module, showing the recursive BFS structure is essential (one-shot drops to 59.7% HR).

The DeepSeek V4 robustness check (Appendix D.3) demonstrates the method is not overfit to a single model generation.

Three search modes (get_all, get_important, get_one) provide a smooth precision-recall trade-off, validating that the taxonomy is a meaningful structural object.

Weaknesses in methodology:

The evaluation is conducted on only two datasets: a cleaned version of ToolRet (1,839 services from an original 44,453 — a 4.1% subset) and publicMCP (1,387 services with only 50 queries). The aggressive cleaning of ToolRet raises questions about selection bias; retaining only services with ground-truth query coverage essentially pre-filters for well-described, benchmark-friendly services. This may overstate A2X's advantage since LLM-native methods benefit disproportionately from high-quality descriptions.

The publicMCP evaluation uses only 50 queries, which is statistically fragile (100% Hit Rate could flip with a handful of failures).

The comparison against embedding baselines uses only open-source models (MiniLM, BGE-large-en, BGE-M3). No comparison against commercial embedding APIs (OpenAI, Cohere) or hybrid retrieval approaches is provided, leaving the 20+ point gap potentially overstated.

Precision is deliberately de-emphasized due to annotation incompleteness, which is acknowledged but still limits the assessment of false-positive rates.

The paper acknowledges but does not characterize the scaling behavior beyond ~2k services, which is critical for the "Internet of Agents" framing.

3. Potential Impact

The problem is genuinely important: as MCP servers, A2A endpoints, and agent-callable services proliferate, scalable discovery becomes a bottleneck. The paper's framing of this as a context management problem is likely to influence how the community thinks about agent-service interaction.

Practical applications:

Agent orchestration platforms needing to route queries across thousands of tools

Enterprise service catalogs where internal tools need discoverable organization

The "agent DNS" analogy in the conclusion is compelling for Internet-scale deployment

Limitations on impact:

The approach requires a full taxonomy rebuild when services change significantly, and incremental updates are "implemented in prototype form but not yet benchmarked"

Build cost (~$4-8 and 3 hours for 1,839 services) is non-trivial and scales uncertainly

The LLM dependency means the system inherits model-specific failure modes and requires API access for every query

4. Timeliness & Relevance

This paper is extremely timely. MCP was released by Anthropic in late 2024, Google's A2A protocol launched in 2025, and the proliferation of agent-callable services is an active, rapidly evolving area. The paper correctly identifies that the current approach of dumping all tool descriptions into context is unsustainable, and the "Lost-in-the-Middle" problem is well-documented. The work addresses a genuine engineering bottleneck that the agent community is actively confronting (as evidenced by the LiveMCPBench citation attributing nearly half of MCP failures to retrieval).

5. Strengths & Limitations

Key Strengths:

Clean problem formulation: The context-management framing with the constraint τ(F; q, S) ≤ B independent of N is elegant and well-motivated

Fully autonomous: No human-curated ontology required; the LLM builds and navigates its own taxonomy

Cross-lingual robustness: English and Chinese results are comparable, inherited from the LLM

Token efficiency: 9× reduction vs. full-context with accuracy improvements is a strong practical result

Comprehensive ablation: Studies 1-3 isolate search, build, and paradigm-level contributions

Reproducibility: Code, datasets, and full audit trails released

Notable Weaknesses:

Scale validation gap: The "Internet of Agents" framing promises scale, but experiments cap at ~2k services. The O(log N) cost claim is qualified as data-dependent and unverified beyond current benchmarks.

Benchmark limitations: Heavily cleaned dataset and tiny query set for publicMCP weaken generalizability claims

Missing strong baselines: No comparison against learned retrievers (ColBERT, fine-tuned models), commercial embeddings, or hybrid retrieve-then-rerank pipelines

Latency not reported: Multiple sequential LLM calls (avg 8 per query) may introduce significant latency compared to single embedding lookups, which is critical for real-time agent systems

Single LLM vendor: All experiments use DeepSeek; while V4 robustness is shown, cross-vendor generalization (GPT-4, Claude, Llama) is untested

Additional Observations

The paper's positioning as a "paradigm-level" contribution (LLM-native discovery as the successor to embedding-based retrieval) is ambitious but insufficiently supported by evidence at the current evaluation scale. The argument that inference cost will continue to fall, making LLM-native discovery economically dominant, is speculative. In practice, hybrid approaches combining embedding pre-filtering with LLM reranking may prove more practical.

The usage-aware refinement direction (optimizing taxonomy depth by query frequency) is a promising extension that would strengthen the practical case considerably.

Rating:5.8/ 10

Significance 6.5Rigor 5.5Novelty 6.5Clarity 7.5

Generated May 29, 2026

Comparison History (16)

vs. KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact due to broader applicability and cross-field relevance: multimodal time series forecasting spans finance, healthcare, climate, operations, and science. Its agentic fusion of LLM semantic reasoning with TSFM numerical forecasting, plus a curated trajectory corpus and an RL-for-forecasting training paradigm, suggests methodological depth and a reusable framework that could influence both forecasting and agent research. Paper 1 addresses an important, timely systems problem in LLM service discovery, but its impact is more specialized to agent registries/tool retrieval compared to the wide downstream reach of forecasting advances.

vs. ReasonOps: Operator Segmentation for LLM Reasoning Traces

claude-opus-4.65/29/2026

ReasonOps provides a foundational analytical framework for understanding LLM reasoning traces, discovering universal operators across 12 models and 8 benchmarks. Its contributions—reasoning fingerprints, correctness prediction, early quality estimation—have broad applicability across the rapidly growing field of reasoning LLMs. The unsupervised, annotation-free methodology is highly reusable. While Paper 1 (A2X) solves an important engineering problem in service discovery with strong practical results, Paper 2 offers deeper scientific insights into LLM cognition with wider cross-disciplinary impact and greater potential to influence future research directions in interpretability and reasoning.

vs. AgentSchool: An LLM-Powered Multi-Agent Simulation for Education

gemini-3.15/29/2026

Paper 1 addresses a critical, widespread technical bottleneck in the rapidly growing field of LLM agents (context limits in service discovery). By providing a scalable, highly effective mechanism for agents to interface with massive numbers of tools, it offers immediate, broad applicability across AI ecosystems. While Paper 2 is highly innovative in educational simulation, Paper 1's foundational contribution to agent architecture gives it greater immediate cross-disciplinary impact and practical utility.

vs. Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

gpt-5.25/29/2026

Paper 2 targets a broadly shared, timely bottleneck in agent ecosystems: scalable service discovery under context limits and Lost-in-the-Middle. Its LLM-native recursive taxonomy construction and progressive disclosure can generalize across domains and infrastructure (MCP/A2A/skills registries), enabling real-world deployment beyond a single vertical. The reported gains versus both full-context prompting and embedding baselines suggest strong practical impact with clear methodological framing. Paper 1 is valuable but more domain-specific (medical AI orchestration) and closer to an incremental multi-agent integration pattern already explored in prior work.

vs. BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion Models

gemini-3.15/29/2026

Paper 1 addresses a foundational challenge in the rapidly emerging field of LLM agents (service discovery and context limits) with a novel architectural approach. Its impact on the scalability of agentic ecosystems and the Internet of Agents promises broader, more transformative scientific follow-up than Paper 2, which, while highly practical and efficient, represents an optimization of existing diffusion models for mobile deployment.

vs. Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation

gpt-5.25/29/2026

Paper 1 has higher scientific impact potential due to stronger cross-domain novelty and real-world relevance: it closes the loop between an LLM agent and a high-fidelity physics simulator to solve a hard inverse problem, demonstrating gains over established Bayesian optimization across chemistries and conditions and validating on real battery data, including degradation fitting. This targets a major bottleneck for battery R&D with clear industrial and scientific payoff and suggests a general paradigm for reasoning-based optimization in scientific computing. Paper 2 is timely and useful for agent ecosystems but is more application/engineering-focused and likely narrower scientifically.

vs. Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

gemini-3.15/29/2026

Paper 1 addresses a highly timely and critical bottleneck in the rapidly expanding LLM agent ecosystem: scalable service discovery within context window limits. By introducing an LLM-native taxonomy construction and progressive-disclosure search, it offers immediate, highly practical real-world applications, especially with the rise of Model Context Protocols. Its massive token savings and strong accuracy gains over embedding baselines suggest broad industry impact. While Paper 2 presents rigorous theoretical advancements in causal bandits, Paper 1's direct alignment with urgent generative AI scalability challenges gives it a higher potential for widespread, near-term impact.

vs. GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation

gemini-3.15/29/2026

Paper 2 addresses a critical bottleneck in the rapidly expanding field of LLM agents (tool/service discovery and context window limitations). Its proposed solution for scalable service orchestration has broad applicability across AI and software engineering, offering high potential impact. Paper 1, while methodologically sound, is constrained to a specific domain (tourist mobility modeling), limiting its broader scientific influence compared to the foundational AI system improvements in Paper 2.

vs. Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

gemini-3.15/29/2026

Paper 1 addresses a fundamental bottleneck in the rapidly expanding field of LLM agents (service discovery and context limits) by introducing a scalable, LLM-native hierarchical retrieval method. Its proposed A2X framework offers broad, real-world utility across any multi-agent or tool-use ecosystem, significantly reducing token costs while improving accuracy over standard embedding baselines. In contrast, Paper 2 provides a more narrow empirical benchmark on screen-conditioned actions, yielding specific observations about fine-tuning mismatches that are less likely to drive widespread architectural or methodological shifts.

vs. Demystifying Data Organization for Enhanced LLM Training

claude-opus-4.65/29/2026

Paper 2 addresses a fundamental and broadly applicable challenge in LLM training—data organization—with systematic guidelines and methods (STR, SAW) validated across multiple scales and stages. Its findings are relevant to virtually all LLM practitioners, backed by Microsoft research with open-source code, and touch on the universal bottleneck of training efficiency. Paper 1, while novel in proposing LLM-native service discovery taxonomies (A2X), addresses a narrower problem in the emerging but still niche Internet of Agents ecosystem. Paper 2's breadth of impact across the entire LLM training community gives it higher potential scientific impact.

vs. Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

gemini-3.15/29/2026

While Paper 1 offers a practical engineering solution for LLM service discovery, Paper 2 provides a profound theoretical contribution by formalizing probabilistic incoherence in multi-agent systems. Its mathematical rigor—utilizing compositional residuals, Rayleigh-quotient predictions, and Boyle-Dykstra projections—establishes foundational limits and deterministic repairs for agent ensembles. This rigorous methodological framework for bounding logical inconsistencies gives Paper 2 a deeper, longer-lasting scientific impact compared to the architectural pipeline proposed in Paper 1.

vs. OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation

claude-opus-4.65/29/2026

Paper 2 addresses a more fundamental and timely infrastructure challenge—service discovery in the emerging Internet of Agents ecosystem—with broad applicability across any system involving LLM-callable services (MCP, A2A, skills). It tackles the well-known Lost-in-the-Middle problem with an elegant, generalizable solution (hierarchical taxonomy + progressive disclosure) that decouples context scarcity from registry scale. This has sweeping implications for agent orchestration, a rapidly growing field. Paper 1, while strong in optimization, targets a narrower domain. Paper 2's architectural contribution is more likely to influence diverse downstream systems and become foundational infrastructure.

vs. The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

claude-opus-4.65/29/2026

Paper 2 critically re-evaluates a high-profile benchmark (GSM-Symbolic) that shaped narratives about LLM reasoning capabilities. By identifying statistical flaws, confounding variables (large number effects), and model-specific failure profiles, it challenges influential conclusions with rigorous methodology. This has broader impact across the AI/ML community by raising standards for benchmark evaluation and nuancing the debate on LLM reasoning. Paper 1, while practically useful for service discovery, addresses a more niche infrastructure problem with narrower audience. Paper 2's methodological contributions (proper statistical testing of benchmarks) are more widely applicable and timely.

vs. CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

gemini-3.15/29/2026

Paper 1 addresses a foundational challenge in the emerging 'Internet of Agents' paradigm, offering a scalable solution for service discovery that overcomes fundamental LLM context limits. While Paper 2 provides significant architectural efficiency gains for VLMs, Paper 1's introduction of an LLM-native hierarchical taxonomy has broader potential to shape future multi-agent architectures, API ecosystems, and tool-use methodologies, making it more conceptually innovative and impactful for the next generation of AI systems.

vs. The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

gemini-3.15/29/2026

Paper 1 exposes a fundamental theoretical flaw in the reasoning mechanisms of masked diffusion models, offering deep scientific insights into how decoding strategies affect logical-flow trajectories. While Paper 2 presents a highly practical engineering solution for LLM context management, Paper 1's contribution to understanding and correcting core architectural and training paradigms has a more profound, lasting impact on the foundational science of generative AI.

vs. Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction

gpt-5.25/29/2026

Paper 2 has higher likely impact: it advances neuro-symbolic QA by improving knowledge graph reliability through ontology-grounded post-extraction correction, enabling SQL/SPARQL-like operations critical for complex, multi-hop, and aggregation questions. The approach is broadly applicable across domains that need consistent structured knowledge (IR, QA, semantic web, data integration) and is timely amid interest in trustworthy RAG. Paper 1 is valuable for agent service discovery and context management, but its scope is narrower and more systems-oriented, with less cross-field methodological generality.