Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs

Weiwei Ding, Zixuan Li, Long Bai, Zhuo Chen, Kun Su, Fei Wang, Xiaolong Jin, Jin Zhang

Jun 2, 2026

arXiv:2606.03705v1 PDF

cs.AI(primary)

#913of 3355·Artificial Intelligence

#913 of 3355 · Artificial Intelligence

Tournament Score

1451±45

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity7.5

Tournament Score

1451±45

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Knowledge Graphs (KGs) are widely used to mitigate the limitations of Large Language Models (LLMs), such as outdated knowledge and hallucinations. Existing LLM-KG integration frameworks typically rely on predefined operators to retrieve factual knowledge from KGs and inject it into prompts for answer generation. This paradigm faces two critical bottlenecks: 1) Inflexibility: The predefined operators are limited in scope and thus lack sufficient compositional expressiveness to fully capture the complex semantics required by KG questions. 2) Unscalability: Direct injection of factual knowledge into prompts limits scalability in handling large-scale factual knowledge. To address these two bottlenecks, we propose Code-on-Graph (CoG), a programmatic reasoning framework for LLM-KG integration. Specifically, given the factual knowledge retrieved at each reasoning step, CoG first identifies the corresponding KG schemas and represents these schemas as Python classes, which serve as abstract interfaces to the retrieved facts. It then generates executable code grounded in these classes, with the retrieved facts instantiated as objects of the corresponding classes during execution. This design enables flexible code-based reasoning while avoiding the direct injection of large-scale factual knowledge into prompts. Experiments on WebQSP, CWQ, and GrailQA demonstrate that CoG outperforms prior state-of-the-art models by up to 10.5%.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Code-on-Graph (CoG)

1. Core Contribution

CoG addresses two well-identified bottlenecks in LLM-KG integration: inflexibility of predefined operators (which cannot express complex operations like ranking with offsets or nested filtering) and unscalability of injecting raw triples into prompts. The key insight is to abstract KG schemas into Python class definitions, generate task-specific executable code over these abstractions, and instantiate retrieved facts as objects at execution time. This effectively separates the schema-level reasoning (which enters the LLM context) from the bulk factual data (which is handled programmatically outside the context window).

The framework operates iteratively through Planning (dynamic subtask decomposition with an evaluator for adaptive termination), Coding (schema-to-class mapping and code generation), and Executing (sandboxed execution with self-correction loops). The approach is well-motivated: object-oriented abstractions are a natural fit for KG structures, and code generation provides Turing-complete expressiveness compared to fixed operator inventories.

2. Methodological Rigor

Strengths in experimental design:

Evaluation on three standard benchmarks (WebQSP, CWQ, GrailQA) with established metrics (Hits@1)

Comparison against both fine-tuning and prompting baselines, with multiple backbone LLMs (Qwen3-Coder-30B-A3B, DeepSeek-V3.2, GPT-4.1-mini) to control for model effects

Comprehensive ablation study isolating contributions of iterative planning, error correction, code reasoning, and JSON vs. class representation

Introduction of Token Utility Rate (TUR), a useful efficiency metric capturing facts-processed-per-token

Concerns:

The GrailQA evaluation uses only 1,000 of 6,763 test samples (following prior work, but still a limitation)

F1 scores (Table 5) show CoG underperforms RoG on WebQSP F1 (67.8 vs. 70.8), despite higher Hits@1, suggesting the method may be less precise on multi-answer questions

The retrieval component uses a simple DistilBERT-based similarity scorer with fixed depth-2 expansion and top-8 edges—this is a potential bottleneck that receives limited analysis

The paper lacks statistical significance tests or confidence intervals

Some baselines use different LLM backbones, making direct comparisons imperfect despite efforts to use multiple models

3. Potential Impact

Direct applications: CoG's design principle—abstracting structured data into typed programming interfaces for LLM-based code generation—generalizes beyond KGs. It could be applied to relational databases, ontologies, or any structured knowledge source where schema-level abstractions exist. The "write your own tools" paradigm is a meaningful advancement over static toolkits.

Efficiency implications: The TUR analysis is particularly compelling. CoG processes 40-47× more factual units per token than PoG while maintaining comparable token budgets and runtime. This addresses a genuine scalability concern in real-world KG applications where subgraphs can be massive.

Broader influence: The work bridges program-aided reasoning (PAL, PoT) with KG reasoning, a combination that has been underexplored. It contributes to the growing literature on neuro-symbolic approaches and LLM-as-programmer paradigms.

4. Timeliness & Relevance

The paper is timely on multiple fronts:

LLM-KG integration is an active research area driven by the need to ground LLMs in factual knowledge

Code generation capabilities of modern LLMs have matured significantly, making programmatic reasoning practical

Context window limitations remain a real constraint, and schema-level abstraction is a pragmatic solution

The shift from fixed operator inventories to dynamic code synthesis reflects a broader trend in agentic AI

The work arrives at a natural inflection point where LLMs are capable enough at code generation to make this approach viable, as evidenced by the strong performance even with the smaller Qwen3-Coder model.

5. Strengths & Limitations

Key Strengths:

Elegant abstraction mechanism: The schema-to-class mapping is a clean design that simultaneously addresses flexibility and scalability

Strong empirical results: Up to 10.5% improvement over SOTA, with particularly impressive gains on GrailQA's compositional (+23.5% over Readi) and zero-shot (+14.0%) splits

Self-correction mechanism: The execution feedback loop with error traces is well-designed and shown to be critical (removing it drops CWQ by 11.6%)

Thorough analysis: Error taxonomy (Figure 6), correction analysis (Figures 7-9), and case studies provide interpretability

Practical efficiency: Reduced LLM calls (25.1→7.0 on CWQ) while handling vastly more facts

Notable Limitations:

Model dependency: Performance with Qwen3-Coder-30B-A3B is notably weaker (76.0 vs. 88.7 on WebQSP with DeepSeek-V3.2), suggesting the method requires strong coding LLMs

Limited to Freebase: All three datasets use Freebase; generalization to other KGs (Wikidata, domain-specific KGs) is untested

CVT handling: The error analysis reveals persistent difficulties with Compound Value Type nodes, a structural limitation

No entity linking evaluation: The paper assumes pre-linked entities, bypassing a significant practical challenge

Retrieval bottleneck: Fixed hyperparameters (depth=2, breadth=8) and simple similarity-based retrieval could limit performance on larger or more complex KG structures

Reproducibility concerns: While prompts are provided, the complex pipeline with multiple LLM calls and sandbox execution may be challenging to reproduce exactly

Additional Observations

The paper's framing of "writing your own tools" versus using predefined tools is a compelling narrative that connects to broader trends in autonomous agents. The JSON ablation showing only minor degradation suggests the core benefit comes from the code execution paradigm rather than specifically from class-based representation, which somewhat weakens the paper's central claim about Python classes being essential.

The error analysis revealing that 30-43% of errors stem from retrieval failures suggests that improvements to the retrieval component could yield substantial additional gains, representing low-hanging fruit for future work.

Rating:7.2/ 10

Significance 7.5Rigor 7Novelty 7.5Clarity 7.5

Generated Jun 3, 2026

Comparison History (22)

vs. Towards a Science of AI Agent Reliability

gpt-5.26/5/2026

Paper 1 likely has higher scientific impact because it proposes a general, safety-inspired framework and twelve reliability metrics that can reshape how AI agents are evaluated across many tasks and domains. Its focus on consistency, robustness, predictability, and safety directly targets a timely, broadly relevant bottleneck for real-world deployment and could influence benchmarks, standards, and regulation. Paper 2 is a solid, innovative LLM–KG integration method with strong task gains, but its impact is more specialized to KG question answering and may be overtaken quickly in a fast-moving area.

vs. MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

claude-opus-4.66/5/2026

Paper 1 introduces a novel reward attribution method for multi-agent RL with LLMs that achieved first place in a major NeurIPS 2025 competition, demonstrating an 8B model can match or surpass GPT-5. This has broader impact across RL, multi-agent systems, and LLM training. The practical demonstration of competitive performance with dramatically smaller models is highly impactful. Paper 2, while solid, offers incremental improvements to KG-QA with a programmatic reasoning framework—a more narrow contribution in a well-explored area with less transformative potential.

vs. Optimal Transport-based Permutation-Invariant Bayesian Optimization of Offshore Wind Farm Layouts

claude-opus-4.66/5/2026

Paper 2 addresses a broadly impactful problem at the intersection of LLMs and knowledge graphs—two of the hottest areas in AI. Its programmatic reasoning framework (CoG) introduces a novel paradigm shift from predefined operators to code-based reasoning, with strong empirical results (up to 10.5% improvement) across multiple benchmarks. The approach has wide applicability across NLP, QA, and knowledge-intensive tasks. Paper 1, while methodologically sound and novel in applying optimal transport to Bayesian optimization, targets a narrower domain (wind farm layout optimization) with more limited cross-field impact.

vs. Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

gpt-5.26/5/2026

Paper 2 has higher potential impact due to broader, timelier applicability: mechanism-grounded reasoning over scientific simulators targets high-stakes decision-making across many domains (engineering, climate, epidemiology, policy) and directly addresses transparency/auditability—key current concerns for AI deployment. Its schema for assumptions, dependencies, and execution traces plus constrained, evidence-grounded explanations suggests stronger methodological rigor and a clearer path to real-world adoption than KG QA gains. Paper 1 is novel and effective within KG question answering, but the scope and cross-field impact are narrower.

vs. What Makes Interaction Trajectories Effective for Training Terminal Agents?

claude-opus-4.66/3/2026

Paper 1 addresses a fundamental and timely question in agent training—what makes training data effective—revealing a counterintuitive 'pedagogical paradox' with significant implications for the rapidly growing field of LLM-based code agents. Its contributions (Terminal-Lego pipeline, harness engineering concept, exceptional data efficiency findings) have broad impact across agent post-training research. Paper 2, while solid, offers an incremental improvement to LLM-KG integration for question answering, a more established and narrower problem space. Paper 1's insights about training data quality over teacher strength and environment-grounded supervision are more likely to reshape research practices broadly.

vs. Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

gemini-3.16/3/2026

Paper 1 addresses highly critical and timely bottlenecks (inflexibility and scalability) in LLM-KG integration, a rapidly expanding area of AI research. By abstracting KG facts into executable code representations, it provides a highly scalable, practical solution with significant empirical gains. While Paper 2 offers a strong foundational contribution to causal inference, Paper 1's methodology is likely to see faster, broader adoption and immediate real-world applications across the pervasive LLM ecosystem.

vs. From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging

claude-opus-4.66/3/2026

Paper 2 introduces a novel framework (PDA with GAM) for aggregating weak supervision signals to improve strong LLMs, addressing the fundamental challenge of scarce high-quality training data. Its broader applicability across model training paradigms, the innovative geometric alignment merging method, and demonstrated gains on diverse benchmarks (knowledge reasoning and agentic search) suggest wider impact. Paper 1, while solid with strong results on KG-QA, addresses a more specific problem (LLM-KG integration) with incremental improvements. Paper 2's insights on weak-to-strong generalization and LoRA merging have broader implications for the LLM training community.

vs. Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

gpt-5.26/3/2026

Paper 2 likely has higher impact due to broader applicability and timeliness: programmatic LLM-on-KG reasoning addresses widely relevant issues (hallucination, scalability, compositional querying) across QA, information retrieval, databases, and agentic coding. The code-as-interface to KG schemas is a notable integration pattern that can generalize beyond QA tasks. It also reports a large empirical gain (up to 10.5%) on multiple established benchmarks. Paper 1 is innovative for multimodal RL credit assignment, but its impact may be narrower to RLVR/vision-language training regimes.

vs. An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

gpt-5.26/3/2026

Paper 2 is more novel and broadly impactful: it isolates and quantifies a production–evaluation gap in large reasoning models via a targeted dataset (VAIR) and supports mechanisms (confirmation bias) with multiple complementary analyses (human baseline, CoT analysis, linear probes, causal patching). This speaks directly to timely concerns about LLM reliability, verification, and safety, with implications across ML, cognitive science, alignment, and evaluation methodology. Paper 1 is a strong systems contribution for KGQA, but its impact is narrower (KG integration) and more incremental relative to existing tool/code-based reasoning paradigms.

vs. Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis

gpt-5.26/3/2026

Paper 2 (Code-on-Graph) appears more novel and broadly impactful: it introduces a general, scalable LLM–knowledge graph integration paradigm using schema-induced Python classes and executable code, improving compositionality and avoiding prompt bloat. It targets widely relevant tasks (KGQA, factual reasoning) with strong benchmarks (WebQSP, CWQ, GrailQA) and sizable reported gains (up to 10.5%), suggesting methodological rigor and clear progress over SOTA. Paper 1 is timely for safety engineering, but its impact may be narrower and more dependent on dataset/metric validity and domain-specific deployment constraints.

vs. ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

gemini-3.16/3/2026

While Paper 2 presents a strong methodological advancement in LLM-KG reasoning, Paper 1 tackles a critical bottleneck in a highly impactful domain: healthcare. By successfully bridging predictive Electronic Health Record (EHR) foundation models with the interpretable reasoning of LLMs, ChatHealthAI directly addresses the crucial need for explainable clinical decision support systems. Its potential to improve real-world patient outcomes and its relevance to the rapidly growing field of medical AI give it a higher potential for broad scientific and societal impact.

vs. AURA: Action-Gated Memory for Robot Policies at Constant VRAM

gpt-5.26/3/2026

Paper 1 targets a timely, under-addressed bottleneck for long-horizon embodied agents: memory bandwidth/endurance on edge hardware. Its action-gated constant-memory design is novel relative to KV-cache and reconstruction-based memories, and it reports concrete system-level gains (constant 4,224B state; large write reductions) with closed-loop robot-policy evaluation, suggesting strong real-world applicability in robotics/AR/edge autonomy. Paper 2 is useful and likely impactful in LLM+KG QA, but programmatic reasoning/code generation over schemas is closer to existing tool/code-based LLM paradigms and its gains are incremental within a narrower application slice.

vs. Subliminal Learning Is Steering Vector Distillation

gemini-3.16/3/2026

Paper 2 offers foundational insights into the mechanisms of LLM fine-tuning and alignment, uncovering how subliminal learning is driven by steering vector distillation. While Paper 1 presents a strong, practical framework for LLM-KG integration, Paper 2 addresses a fundamental, counter-intuitive phenomenon in deep learning. Its mechanistic explanation of how non-semantic data transfers semantic traits has profound implications for AI safety, interpretability, and alignment, giving it a higher potential for broad, long-lasting scientific impact across the theoretical and applied AI communities.

vs. InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

gemini-3.16/3/2026

Paper 1 presents a highly innovative approach to LLM-KG integration by abstracting KG schemas into Python classes and utilizing code generation for reasoning. This addresses critical bottlenecks of inflexibility and context-window scalability in traditional RAG systems. Its substantial performance gains (up to 10.5%) on standard benchmarks and the broad applicability of bridging LLMs, code execution, and structured data suggest a higher potential for real-world impact and methodological adoption compared to the specialized RL reward shaping in Paper 2.

vs. Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

claude-opus-4.66/3/2026

Paper 1 (Code-on-Graph) addresses fundamental limitations of LLM-KG integration with a novel programmatic reasoning framework that demonstrates strong empirical results (up to 10.5% improvement) on established benchmarks. Its approach of representing KG schemas as Python classes for code-based reasoning is innovative and broadly applicable. Paper 2 tackles an important but narrower problem (instruction following constraints) with a graph-based approach. While useful, Paper 1 has greater breadth of impact, stronger methodological novelty in bridging code generation with KG reasoning, and addresses a more foundational challenge in the LLM ecosystem.

vs. Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

claude-opus-4.66/3/2026

Paper 1 addresses a highly practical and timely problem—integrating LLMs with Knowledge Graphs—offering a novel programmatic reasoning framework (CoG) with strong empirical results (up to 10.5% improvement over SOTA). It has broad applicability across NLP, question answering, and AI systems. Paper 2, while theoretically rigorous in extending non-monotonic reasoning to defeasible standpoint logic, addresses a niche area in formal logic with a narrower audience and fewer immediate real-world applications. The timeliness and breadth of impact favor Paper 1 significantly.

vs. Toward a Modular Architecture for Embedded AI Agent Systems at the Edge

gemini-3.16/3/2026

Paper 1 demonstrates higher scientific impact due to its strong methodological rigor and concrete empirical results. While Paper 2 presents a timely theoretical architecture for edge AI, it explicitly lacks empirical benchmarks. In contrast, Paper 1 introduces a novel programmatic reasoning framework for LLM-KG integration that solves critical scalability bottlenecks. By validating its approach on standard datasets and achieving up to a 10.5% improvement over state-of-the-art models, Paper 1 offers proven, immediate utility and broad applicability in the highly active research area of LLM reasoning and retrieval-augmented generation.

vs. scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation

claude-opus-4.66/3/2026

Paper 2 (Code-on-Graph) addresses a fundamental challenge in LLM-KG integration with a novel programmatic reasoning framework that demonstrates significant performance improvements (up to 10.5%) on established benchmarks. It tackles the broadly impactful problem of LLM hallucination and knowledge limitations, which is highly timely given the widespread adoption of LLMs. Paper 1 (scTranslation) provides a valuable benchmark for single-cell multi-omics translation but is more incremental as a benchmarking study rather than introducing a fundamentally new method. Paper 2's broader applicability across AI/NLP gives it higher potential impact.

vs. TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

claude-opus-4.66/3/2026

Paper 1 presents a novel programmatic reasoning framework (Code-on-Graph) for LLM-KG integration that addresses fundamental limitations of existing approaches with strong empirical results (up to 10.5% improvement over SOTA). It introduces innovative technical contributions—representing KG schemas as Python classes and using code generation for reasoning—with broad applicability across knowledge-intensive NLP tasks. Paper 2, while practically useful, is primarily an engineering contribution combining existing evaluation dimensions into a resource-efficient pipeline without significant methodological novelty. Paper 1 has greater potential to influence future research directions in knowledge-grounded reasoning.

vs. StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

gpt-5.26/3/2026

Paper 1 is likely to have higher scientific impact due to broader relevance and novelty: it proposes a general LLM–knowledge graph integration paradigm (schema-to-code, executable reasoning) that addresses scalability and compositionality limits of prompt-injection retrieval, with strong gains across multiple standard KGQA benchmarks. This could influence LLM tool-use, neuro-symbolic reasoning, and retrieval-augmented systems beyond QA. Paper 2 is timely and practically valuable for multi-agent reliability, but is narrower (failure attribution on a specific benchmark) and more incremental in methodology (feature encoding + temporal/attention modeling), likely yielding more limited cross-field impact.