Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

Jinnuo Liu, Yue Peng, Jinhan Niu, Hongyi Wen

#1149 of 3355 · Artificial Intelligence
Share
Tournament Score
1438±44
10501800
53%
Win Rate
9
Wins
8
Losses
17
Matches
Rating
7.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordinate signatures, module paths, input-output contracts, semantics, and executable usage patterns. Existing novel-API benchmarks are typically static, rely on coarse pass/fail metrics, or use synthetic APIs that may not reflect real library evolution. We introduce NovelAPIBench, a fully automated dynamic benchmark that, for any base model and target library, discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories. Across about 1.9K tasks, four base models, and five domains, we compare knowledge injected through retrieval with knowledge internalized through parametric adaptation. We find that knowledge components are not interchangeable: usage examples are the strongest standalone signal, while the best two-component setting pairs signatures with either mechanisms or examples depending on the domain and backbone. Adding more context, especially source code, can hurt by increasing import-path errors. Parametric adaptation also does not replace retrieval once external knowledge is removed; rather, fine-tuning mainly teaches models how to use provided bundles, and this ability transfers to held-out libraries. These results suggest that retrieval and tuning play complementary roles: retrieval supplies volatile API content, while tuning improves procedural integration.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: NovelAPIBench

1. Core Contribution

NovelAPIBench addresses a genuine and timely problem: how do code LLMs handle APIs absent from their pretraining data? The paper's main contribution is a four-stage automated pipeline that (1) discovers genuinely novel APIs by diffing library versions against a model's knowledge cutoff, (2) decomposes API knowledge into structured bundles (surface signatures, examples, mechanism prose, source code), (3) generates executable coding tasks with difficulty-graded test harnesses, and (4) applies model-conditional filtering to ensure tasks are genuinely novel to the target model.

The paper makes three distinct contributions: the benchmark infrastructure itself, an automated 6-class failure taxonomy that diagnoses *why* models fail (not just whether they fail), and a systematic empirical study comparing knowledge injection via retrieval versus parametric adaptation across ~1.9K tasks, four backbone models, and five domains.

2. Methodological Rigor

The experimental design is notably thorough. The benchmark pipeline has multiple quality gates—temporal cutoff for novelty, empirical verification that the base model cannot solve tasks unaided (C2), and strong-model solvability checks (C3). The "execute-then-assert" harness strategy with auto-injected API spies is a clever mechanism to prevent solutions that bypass the target API.

The knowledge decomposition is well-motivated: surface (S), exemplars (E), mechanism prose (M_prose), and source code (M_code) are ablated systematically across 9 primary conditions. The factorial design enables attribution of which knowledge components resolve which failure modes. The human evaluation of the failure taxonomy shows strong inter-annotator agreement (κ=0.876 at 6-class level, κ=0.928 at 4-class rollup), and human-LLM judge agreement is adequate (κ≈0.80-0.89).

However, some limitations affect rigor: (1) the parametric adaptation negative result is explicitly acknowledged as compute-bounded—LoRA rank 64, 3 epochs, ~1,500 training tasks—so strong claims about the impossibility of internalization are premature; (2) the dl domain was subsampled from 835 to 300 tasks for compute reasons, introducing potential bias; (3) the paper tests only ~8B parameter models, leaving open whether findings hold at larger scales.

3. Potential Impact

Benchmark design paradigm. The model-conditional, regenerable nature of NovelAPIBench is its most impactful design choice. Unlike static benchmarks that degrade as models absorb more training data, this benchmark can be regenerated per model, addressing a persistent problem in code generation evaluation. This could inspire similar dynamic benchmark designs in other domains.

Practical implications for RAG system design. The finding that usage examples are the strongest standalone component, while compact bundles (S+M_prose) often outperform richer contexts including source code, has direct implications for documentation design and retrieval system engineering. The insight that source code introduces import-path noise is counterintuitive and actionable.

Complementarity of retrieval and tuning. The finding that fine-tuning teaches a transferable "meta-skill" for using API bundles rather than memorizing API facts is theoretically interesting and practically relevant. The leave-one-out experiment showing that API selection transfers but module-path knowledge does not is a clean decomposition.

Failure taxonomy as training signal. The six-class diagnostic taxonomy could serve as process-level supervision for future RL-based code generation training, though this is speculative.

4. Timeliness & Relevance

The paper addresses a genuine bottleneck: LLMs are increasingly used as coding agents but must navigate rapidly evolving API landscapes. The degradation of static benchmarks is a real concern—the paper directly tackles benchmark staleness. The comparison of RAG, SFT, RAFT, and knowledge editing methods (GRACE, MEMIT, AlphaEdit) under a unified framework is timely given the current proliferation of adaptation methods.

5. Strengths & Limitations

Strengths:

  • Dynamic, regenerable design solves the benchmark obsolescence problem
  • Fine-grained diagnostics go well beyond pass/fail, enabling mechanistic understanding
  • Comprehensive ablation across knowledge components, domains, difficulty levels, and backbones
  • Leave-one-out transfer experiments provide clean evidence about what fine-tuning actually learns
  • Exceptional documentation: appendices are thorough with full algorithms, human evaluation protocols, licensing details, and reproducibility information
  • The real-retriever vs. oracle comparison (Appendix E.4) reveals that deployment rankings can differ from ablation rankings—a practically important nuance
  • Limitations:

  • Python-only, single-API tasks—real agentic workflows chain multiple APIs
  • Scale-limited: only ~8B models tested; findings may not hold at 70B+
  • Compute-bounded negative result: the claim that parametric methods cannot internalize APIs is qualified but could mislead readers
  • GPT-5-mini dependency for task generation and failure classification introduces non-reproducibility and potential bias
  • The benchmark currently covers 19 libraries; broader coverage would strengthen generalizability claims
  • Some domain coverage is thin (flask: 1 task, pymatgen: 4 tasks)
  • Notable observations:

  • The finding that reasoning-oriented backbones (R1-Distill) resist source-induced import noise while others don't is intriguing but based on a single model
  • The paper's framing of "content acquisition" vs. "procedural realization" as distinct sub-skills is a useful conceptual contribution, though not entirely novel in the knowledge representation literature
  • The data-scaling probe (Appendix E.7) showing the held-out gap shrinks with data somewhat undermines the strong framing of the negative internalization result
  • Overall Assessment

    This is a well-executed systems paper that makes meaningful contributions to code generation evaluation methodology. The dynamic benchmark design, diagnostic failure taxonomy, and systematic empirical findings about knowledge component roles represent genuine advances. The paper's primary value lies in its infrastructure and empirical insights rather than algorithmic novelty. The findings about retrieval-tuning complementarity, while not shocking, are rigorously demonstrated and practically useful. The work would benefit from testing at larger model scales and on multi-API compositional tasks.

    Rating:7.2/ 10
    Significance 7.5Rigor 7.5Novelty 6.8Clarity 8

    Generated Jun 3, 2026

    Comparison History (17)

    vs. Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection
    gemini-3.16/5/2026

    Paper 2 addresses a fundamental flaw in how LLM reasoning is evaluated, revealing that widely used probabilistic confidence metrics capture surface-level fluency rather than true logical dependencies. This challenges a core assumption in the field. While Paper 1 offers a valuable dynamic benchmark for API tool use, Paper 2's findings on reasoning evaluation and its novel contrastive causality metric have broader, more foundational implications across the entire AI community, likely leading to higher scientific impact.

    vs. SciDER: Scientific Data-centric End-to-end Researcher
    claude-opus-4.66/5/2026

    SciDER addresses the broader and more impactful problem of automating the entire scientific research lifecycle with a multi-agent system, spanning ideation, data analysis, experimentation, and critique. It releases open-source artifacts (dataset and model) that democratize access, and its breadth of impact across multiple scientific domains and benchmarks is substantial. While Paper 1 provides rigorous diagnostic insights into LLM API usage—a valuable but narrower contribution—Paper 2's potential to accelerate scientific discovery across fields, combined with its timeliness in the AI-for-science movement, gives it higher estimated impact.

    vs. EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management
    claude-opus-4.66/3/2026

    EvoDS presents a more comprehensive and novel framework with broader impact. It introduces two innovative mechanisms (ASA and ACC) for autonomous data science agents, provides theoretical guarantees, and demonstrates strong empirical results (28.9% improvement across four benchmarks). The self-evolving agent paradigm with skill learning and adaptive context management addresses fundamental limitations in LLM-based automation with wide applicability. Paper 2, while valuable for understanding API knowledge gaps, is more narrowly focused on benchmarking novel API acquisition and provides primarily diagnostic insights rather than a transformative new capability.

    vs. scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation
    gpt-5.26/3/2026

    Paper 1 likely has higher impact due to greater timeliness and cross-field breadth: robust LLM tool/API use is central to current AI deployment, software engineering, agents, and safety/reliability. NovelAPIBench’s automated dynamic construction (discovering novel APIs, generating executable tasks, and fine-grained diagnostic categorization) is methodologically innovative and broadly reusable across libraries and models, enabling standardized evaluation and driving practical improvements in retrieval + tuning strategies. Paper 2 is valuable and rigorous for single-cell multi-omics, but its domain scope is narrower and benchmarking modality translation, while important, is less universally transformative than LLM tool-use generalization.

    vs. Iteris: Agentic Research Loops for Computational Mathematics
    claude-opus-4.66/3/2026

    Paper 1 demonstrates AI systems contributing to solving genuine open mathematical research problems, producing verified novel results (a phase diagram and a counterexample) on problems from a Simons Workshop collection. This represents a significant milestone in AI-assisted scientific discovery with broad implications across computational mathematics. Paper 2, while methodologically sound, addresses the more incremental question of how LLMs handle novel APIs—an important but narrower engineering contribution within the well-explored space of LLM code generation benchmarks. Paper 1's novelty in agentic research workflows for open problems has greater potential to reshape scientific practice.

    vs. Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis
    claude-opus-4.66/3/2026

    Paper 1 introduces a comprehensive, automated benchmarking framework (NovelAPIBench) that addresses a fundamental challenge in LLM code generation—novel API acquisition—with rigorous methodology across ~1.9K tasks, multiple models, and domains. It provides nuanced diagnostic insights about knowledge component interactions and the complementary roles of retrieval vs. fine-tuning, which have broad implications for the LLM tooling ecosystem. Paper 2 addresses an important but narrower niche (hazard identification via multi-agent dialogue) with less methodological depth and more incremental contributions. Paper 1's broader applicability, deeper analysis, and actionable findings give it higher impact potential.

    vs. Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction
    gemini-3.16/3/2026

    Paper 2 addresses a critical blind spot in LLM deployment: the reliability and uncertainty of compressed models. Since almost all practical LLM deployments rely on compression to reduce costs, highlighting the decoupling of accuracy and uncertainty has profound implications for safety-critical applications. This fundamental insight into model trustworthiness offers broader, more urgent relevance across the AI field than the specialized code-generation and tool-use focus of Paper 1.

    vs. Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery
    gemini-3.16/3/2026

    Paper 2 presents a paradigm-shifting, planet-scale infrastructure for decentralized, AI-driven scientific discovery across multiple disciplines. While Paper 1 offers a rigorous and valuable benchmark for LLM API tool use, Paper 2's ambition to unify siloed scientific capabilities (wet labs, simulations, proof engines) into an emergent, self-organizing system has far broader implications. The real-world validations in complex physics and biology tasks demonstrate transformative potential, making its estimated scientific impact across diverse fields significantly higher.

    vs. When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning
    claude-opus-4.66/3/2026

    Paper 1 introduces a comprehensive, reusable benchmark framework (NovelAPIBench) addressing a fundamental challenge in LLM tool use—novel API acquisition—with systematic diagnostic analysis across multiple dimensions. Its findings about complementary roles of retrieval and fine-tuning, and the decomposition of API knowledge into actionable components, have broad implications for code generation, tool-augmented LLMs, and continual learning. Paper 2 provides valuable insights into multi-agent debate dynamics with a useful theoretical condition, but addresses a narrower problem (data cleaning) with more incremental contributions. Paper 1's methodological infrastructure and generalizable insights likely yield broader and longer-lasting impact.

    vs. ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents
    claude-opus-4.66/3/2026

    Paper 1 (NovelAPIBench) introduces a comprehensive, automated diagnostic benchmark framework that addresses a fundamental challenge in LLM tool use—novel API acquisition—with systematic decomposition of knowledge components and actionable insights about retrieval vs. parametric adaptation. Its findings about complementary roles of retrieval and fine-tuning have broad implications for LLM system design. Paper 2 (ToolGate) addresses the narrower problem of token-efficient tool call gating for VLM agents with a lightweight controller, yielding practical but incremental improvements. Paper 1's methodological contribution, diagnostic framework, and generalizable insights give it wider and deeper potential impact.

    vs. MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
    gpt-5.26/3/2026

    Paper 2 has higher estimated impact due to broader applicability beyond a single domain: diagnosing and improving LLM tool/API use affects code generation, agents, software engineering, and LLM alignment/reliability. Its automated pipeline (discovering novel APIs, generating executable tasks, and fine-grained failure diagnostics) is methodologically strong and scalable across libraries and models, enabling continuous evaluation as APIs evolve. The findings on retrieval vs. parametric adaptation and non-interchangeable knowledge components are actionable and timely for current agentic coding systems. Paper 1 is valuable but more domain-specific and constrained by clinical UI availability.

    vs. SDR: Set-Distance Rewards for Radiology Report Generation
    gpt-5.26/3/2026

    Paper 2 likely has higher scientific impact due to broader, timely relevance: robust LLM tool/API use affects many domains beyond a single clinical task. NovelAPIBench is a general, automated, dynamic benchmark applicable to arbitrary libraries and models, enabling standardized diagnosis of failure modes and informing both retrieval and fine-tuning strategies. Its methodological contribution (task generation, decomposed knowledge bundles, diagnostic taxonomy) supports reproducible evaluation and could influence model training, agent design, and software engineering research widely. Paper 1 is innovative and rigorous but is more specialized to radiology report generation and depends on embedding/reward design choices.

    vs. RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases
    gpt-5.26/3/2026

    Paper 2 likely has higher impact due to stronger novelty and timeliness: a fully automated, dynamic benchmark for diagnosing LLM tool/API acquisition directly targets a major current bottleneck in agentic coding and deployment. Its diagnostic taxonomy, cross-model/domain evaluation, and actionable findings (non-interchangeable knowledge components; retrieval vs tuning complementarity) can influence both research and production practices across ML, software engineering, and evaluation. Paper 1 is solid and application-relevant for relational ML, but appears more incremental (task-head/ masking/ TF-IDF enhancements) with narrower cross-field reach.

    vs. TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection
    claude-opus-4.66/3/2026

    TriLens introduces a novel, generalizable white-box method for hallucination detection that leverages per-layer entropy trajectories across internal model components. Hallucination detection is a critical, broadly applicable problem across all LLM applications. The method is elegant, lightweight (3L-dimensional), and provides mechanistic interpretability insights. Paper 2, while thorough in benchmarking API knowledge gaps, addresses a narrower problem (novel API usage in code generation) and is more of an empirical benchmark study. TriLens has broader impact potential across interpretability, safety, and deployment of LLMs.

    vs. DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration
    gemini-3.16/3/2026

    While Paper 1 provides valuable diagnostic insights into LLM API use, Paper 2 tackles the next major frontier in AI: long-horizon, human-in-the-loop desktop agents operating specialized professional software. By formalizing realistic collaborative interactions and moving beyond short, simplified GUI tasks, DeskCraft addresses a critical bottleneck in deploying truly autonomous and cooperative AI assistants in real-world workflows.

    vs. TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment
    gemini-3.16/3/2026

    Paper 1 addresses a pervasive bottleneck in LLM development: integrating novel APIs. Its dynamic benchmark and rigorous empirical analysis of retrieval-augmented generation versus fine-tuning provide actionable, high-utility insights for building autonomous agents. While Paper 2 tackles a vital ethical alignment issue, Paper 1's focus on tool use and knowledge acquisition promises broader and more immediate real-world applicability across industry and academia.

    vs. KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning
    claude-opus-4.66/3/2026

    Paper 1 introduces a novel, systematic benchmark methodology for diagnosing LLM knowledge gaps in API usage with broader applicability across models and domains. Its decomposed diagnostic framework and findings about complementary roles of retrieval vs. fine-tuning offer fundamental insights for the LLM tool-use community. Paper 2, while showing strong empirical results on math benchmarks, is more narrowly focused on context engineering for mathematical reasoning with incremental improvements over existing methods. Paper 1's reusable benchmark infrastructure and generalizable insights about knowledge components give it wider potential impact across software engineering and AI research.