Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition
Jinnuo Liu, Yue Peng, Jinhan Niu, Hongyi Wen
Abstract
Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordinate signatures, module paths, input-output contracts, semantics, and executable usage patterns. Existing novel-API benchmarks are typically static, rely on coarse pass/fail metrics, or use synthetic APIs that may not reflect real library evolution. We introduce NovelAPIBench, a fully automated dynamic benchmark that, for any base model and target library, discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories. Across about 1.9K tasks, four base models, and five domains, we compare knowledge injected through retrieval with knowledge internalized through parametric adaptation. We find that knowledge components are not interchangeable: usage examples are the strongest standalone signal, while the best two-component setting pairs signatures with either mechanisms or examples depending on the domain and backbone. Adding more context, especially source code, can hurt by increasing import-path errors. Parametric adaptation also does not replace retrieval once external knowledge is removed; rather, fine-tuning mainly teaches models how to use provided bundles, and this ability transfers to held-out libraries. These results suggest that retrieval and tuning play complementary roles: retrieval supplies volatile API content, while tuning improves procedural integration.
AI Impact Assessments
(1 models)Scientific Impact Assessment: NovelAPIBench
1. Core Contribution
NovelAPIBench addresses a genuine and timely problem: how do code LLMs handle APIs absent from their pretraining data? The paper's main contribution is a four-stage automated pipeline that (1) discovers genuinely novel APIs by diffing library versions against a model's knowledge cutoff, (2) decomposes API knowledge into structured bundles (surface signatures, examples, mechanism prose, source code), (3) generates executable coding tasks with difficulty-graded test harnesses, and (4) applies model-conditional filtering to ensure tasks are genuinely novel to the target model.
The paper makes three distinct contributions: the benchmark infrastructure itself, an automated 6-class failure taxonomy that diagnoses *why* models fail (not just whether they fail), and a systematic empirical study comparing knowledge injection via retrieval versus parametric adaptation across ~1.9K tasks, four backbone models, and five domains.
2. Methodological Rigor
The experimental design is notably thorough. The benchmark pipeline has multiple quality gates—temporal cutoff for novelty, empirical verification that the base model cannot solve tasks unaided (C2), and strong-model solvability checks (C3). The "execute-then-assert" harness strategy with auto-injected API spies is a clever mechanism to prevent solutions that bypass the target API.
The knowledge decomposition is well-motivated: surface (S), exemplars (E), mechanism prose (M_prose), and source code (M_code) are ablated systematically across 9 primary conditions. The factorial design enables attribution of which knowledge components resolve which failure modes. The human evaluation of the failure taxonomy shows strong inter-annotator agreement (κ=0.876 at 6-class level, κ=0.928 at 4-class rollup), and human-LLM judge agreement is adequate (κ≈0.80-0.89).
However, some limitations affect rigor: (1) the parametric adaptation negative result is explicitly acknowledged as compute-bounded—LoRA rank 64, 3 epochs, ~1,500 training tasks—so strong claims about the impossibility of internalization are premature; (2) the dl domain was subsampled from 835 to 300 tasks for compute reasons, introducing potential bias; (3) the paper tests only ~8B parameter models, leaving open whether findings hold at larger scales.
3. Potential Impact
Benchmark design paradigm. The model-conditional, regenerable nature of NovelAPIBench is its most impactful design choice. Unlike static benchmarks that degrade as models absorb more training data, this benchmark can be regenerated per model, addressing a persistent problem in code generation evaluation. This could inspire similar dynamic benchmark designs in other domains.
Practical implications for RAG system design. The finding that usage examples are the strongest standalone component, while compact bundles (S+M_prose) often outperform richer contexts including source code, has direct implications for documentation design and retrieval system engineering. The insight that source code introduces import-path noise is counterintuitive and actionable.
Complementarity of retrieval and tuning. The finding that fine-tuning teaches a transferable "meta-skill" for using API bundles rather than memorizing API facts is theoretically interesting and practically relevant. The leave-one-out experiment showing that API selection transfers but module-path knowledge does not is a clean decomposition.
Failure taxonomy as training signal. The six-class diagnostic taxonomy could serve as process-level supervision for future RL-based code generation training, though this is speculative.
4. Timeliness & Relevance
The paper addresses a genuine bottleneck: LLMs are increasingly used as coding agents but must navigate rapidly evolving API landscapes. The degradation of static benchmarks is a real concern—the paper directly tackles benchmark staleness. The comparison of RAG, SFT, RAFT, and knowledge editing methods (GRACE, MEMIT, AlphaEdit) under a unified framework is timely given the current proliferation of adaptation methods.
5. Strengths & Limitations
Strengths:
Limitations:
Notable observations:
Overall Assessment
This is a well-executed systems paper that makes meaningful contributions to code generation evaluation methodology. The dynamic benchmark design, diagnostic failure taxonomy, and systematic empirical findings about knowledge component roles represent genuine advances. The paper's primary value lies in its infrastructure and empirical insights rather than algorithmic novelty. The findings about retrieval-tuning complementarity, while not shocking, are rigorously demonstrated and practically useful. The work would benefit from testing at larger model scales and on multi-API compositional tasks.
Generated Jun 3, 2026
Comparison History (17)
Paper 2 addresses a fundamental flaw in how LLM reasoning is evaluated, revealing that widely used probabilistic confidence metrics capture surface-level fluency rather than true logical dependencies. This challenges a core assumption in the field. While Paper 1 offers a valuable dynamic benchmark for API tool use, Paper 2's findings on reasoning evaluation and its novel contrastive causality metric have broader, more foundational implications across the entire AI community, likely leading to higher scientific impact.
SciDER addresses the broader and more impactful problem of automating the entire scientific research lifecycle with a multi-agent system, spanning ideation, data analysis, experimentation, and critique. It releases open-source artifacts (dataset and model) that democratize access, and its breadth of impact across multiple scientific domains and benchmarks is substantial. While Paper 1 provides rigorous diagnostic insights into LLM API usage—a valuable but narrower contribution—Paper 2's potential to accelerate scientific discovery across fields, combined with its timeliness in the AI-for-science movement, gives it higher estimated impact.
EvoDS presents a more comprehensive and novel framework with broader impact. It introduces two innovative mechanisms (ASA and ACC) for autonomous data science agents, provides theoretical guarantees, and demonstrates strong empirical results (28.9% improvement across four benchmarks). The self-evolving agent paradigm with skill learning and adaptive context management addresses fundamental limitations in LLM-based automation with wide applicability. Paper 2, while valuable for understanding API knowledge gaps, is more narrowly focused on benchmarking novel API acquisition and provides primarily diagnostic insights rather than a transformative new capability.
Paper 1 likely has higher impact due to greater timeliness and cross-field breadth: robust LLM tool/API use is central to current AI deployment, software engineering, agents, and safety/reliability. NovelAPIBench’s automated dynamic construction (discovering novel APIs, generating executable tasks, and fine-grained diagnostic categorization) is methodologically innovative and broadly reusable across libraries and models, enabling standardized evaluation and driving practical improvements in retrieval + tuning strategies. Paper 2 is valuable and rigorous for single-cell multi-omics, but its domain scope is narrower and benchmarking modality translation, while important, is less universally transformative than LLM tool-use generalization.
Paper 1 demonstrates AI systems contributing to solving genuine open mathematical research problems, producing verified novel results (a phase diagram and a counterexample) on problems from a Simons Workshop collection. This represents a significant milestone in AI-assisted scientific discovery with broad implications across computational mathematics. Paper 2, while methodologically sound, addresses the more incremental question of how LLMs handle novel APIs—an important but narrower engineering contribution within the well-explored space of LLM code generation benchmarks. Paper 1's novelty in agentic research workflows for open problems has greater potential to reshape scientific practice.
Paper 1 introduces a comprehensive, automated benchmarking framework (NovelAPIBench) that addresses a fundamental challenge in LLM code generation—novel API acquisition—with rigorous methodology across ~1.9K tasks, multiple models, and domains. It provides nuanced diagnostic insights about knowledge component interactions and the complementary roles of retrieval vs. fine-tuning, which have broad implications for the LLM tooling ecosystem. Paper 2 addresses an important but narrower niche (hazard identification via multi-agent dialogue) with less methodological depth and more incremental contributions. Paper 1's broader applicability, deeper analysis, and actionable findings give it higher impact potential.
Paper 2 addresses a critical blind spot in LLM deployment: the reliability and uncertainty of compressed models. Since almost all practical LLM deployments rely on compression to reduce costs, highlighting the decoupling of accuracy and uncertainty has profound implications for safety-critical applications. This fundamental insight into model trustworthiness offers broader, more urgent relevance across the AI field than the specialized code-generation and tool-use focus of Paper 1.
Paper 2 presents a paradigm-shifting, planet-scale infrastructure for decentralized, AI-driven scientific discovery across multiple disciplines. While Paper 1 offers a rigorous and valuable benchmark for LLM API tool use, Paper 2's ambition to unify siloed scientific capabilities (wet labs, simulations, proof engines) into an emergent, self-organizing system has far broader implications. The real-world validations in complex physics and biology tasks demonstrate transformative potential, making its estimated scientific impact across diverse fields significantly higher.
Paper 1 introduces a comprehensive, reusable benchmark framework (NovelAPIBench) addressing a fundamental challenge in LLM tool use—novel API acquisition—with systematic diagnostic analysis across multiple dimensions. Its findings about complementary roles of retrieval and fine-tuning, and the decomposition of API knowledge into actionable components, have broad implications for code generation, tool-augmented LLMs, and continual learning. Paper 2 provides valuable insights into multi-agent debate dynamics with a useful theoretical condition, but addresses a narrower problem (data cleaning) with more incremental contributions. Paper 1's methodological infrastructure and generalizable insights likely yield broader and longer-lasting impact.
Paper 1 (NovelAPIBench) introduces a comprehensive, automated diagnostic benchmark framework that addresses a fundamental challenge in LLM tool use—novel API acquisition—with systematic decomposition of knowledge components and actionable insights about retrieval vs. parametric adaptation. Its findings about complementary roles of retrieval and fine-tuning have broad implications for LLM system design. Paper 2 (ToolGate) addresses the narrower problem of token-efficient tool call gating for VLM agents with a lightweight controller, yielding practical but incremental improvements. Paper 1's methodological contribution, diagnostic framework, and generalizable insights give it wider and deeper potential impact.
Paper 2 has higher estimated impact due to broader applicability beyond a single domain: diagnosing and improving LLM tool/API use affects code generation, agents, software engineering, and LLM alignment/reliability. Its automated pipeline (discovering novel APIs, generating executable tasks, and fine-grained failure diagnostics) is methodologically strong and scalable across libraries and models, enabling continuous evaluation as APIs evolve. The findings on retrieval vs. parametric adaptation and non-interchangeable knowledge components are actionable and timely for current agentic coding systems. Paper 1 is valuable but more domain-specific and constrained by clinical UI availability.
Paper 2 likely has higher scientific impact due to broader, timely relevance: robust LLM tool/API use affects many domains beyond a single clinical task. NovelAPIBench is a general, automated, dynamic benchmark applicable to arbitrary libraries and models, enabling standardized diagnosis of failure modes and informing both retrieval and fine-tuning strategies. Its methodological contribution (task generation, decomposed knowledge bundles, diagnostic taxonomy) supports reproducible evaluation and could influence model training, agent design, and software engineering research widely. Paper 1 is innovative and rigorous but is more specialized to radiology report generation and depends on embedding/reward design choices.
Paper 2 likely has higher impact due to stronger novelty and timeliness: a fully automated, dynamic benchmark for diagnosing LLM tool/API acquisition directly targets a major current bottleneck in agentic coding and deployment. Its diagnostic taxonomy, cross-model/domain evaluation, and actionable findings (non-interchangeable knowledge components; retrieval vs tuning complementarity) can influence both research and production practices across ML, software engineering, and evaluation. Paper 1 is solid and application-relevant for relational ML, but appears more incremental (task-head/ masking/ TF-IDF enhancements) with narrower cross-field reach.
TriLens introduces a novel, generalizable white-box method for hallucination detection that leverages per-layer entropy trajectories across internal model components. Hallucination detection is a critical, broadly applicable problem across all LLM applications. The method is elegant, lightweight (3L-dimensional), and provides mechanistic interpretability insights. Paper 2, while thorough in benchmarking API knowledge gaps, addresses a narrower problem (novel API usage in code generation) and is more of an empirical benchmark study. TriLens has broader impact potential across interpretability, safety, and deployment of LLMs.
While Paper 1 provides valuable diagnostic insights into LLM API use, Paper 2 tackles the next major frontier in AI: long-horizon, human-in-the-loop desktop agents operating specialized professional software. By formalizing realistic collaborative interactions and moving beyond short, simplified GUI tasks, DeskCraft addresses a critical bottleneck in deploying truly autonomous and cooperative AI assistants in real-world workflows.
Paper 1 addresses a pervasive bottleneck in LLM development: integrating novel APIs. Its dynamic benchmark and rigorous empirical analysis of retrieval-augmented generation versus fine-tuning provide actionable, high-utility insights for building autonomous agents. While Paper 2 tackles a vital ethical alignment issue, Paper 1's focus on tool use and knowledge acquisition promises broader and more immediate real-world applicability across industry and academia.
Paper 1 introduces a novel, systematic benchmark methodology for diagnosing LLM knowledge gaps in API usage with broader applicability across models and domains. Its decomposed diagnostic framework and findings about complementary roles of retrieval vs. fine-tuning offer fundamental insights for the LLM tool-use community. Paper 2, while showing strong empirical results on math benchmarks, is more narrowly focused on context engineering for mathematical reasoning with incremental improvements over existing methods. Paper 1's reusable benchmark infrastructure and generalizable insights about knowledge components give it wider potential impact across software engineering and AI research.