Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs
Weiwei Ding, Zixuan Li, Long Bai, Zhuo Chen, Kun Su, Fei Wang, Xiaolong Jin, Jin Zhang
Abstract
Knowledge Graphs (KGs) are widely used to mitigate the limitations of Large Language Models (LLMs), such as outdated knowledge and hallucinations. Existing LLM-KG integration frameworks typically rely on predefined operators to retrieve factual knowledge from KGs and inject it into prompts for answer generation. This paradigm faces two critical bottlenecks: 1) Inflexibility: The predefined operators are limited in scope and thus lack sufficient compositional expressiveness to fully capture the complex semantics required by KG questions. 2) Unscalability: Direct injection of factual knowledge into prompts limits scalability in handling large-scale factual knowledge. To address these two bottlenecks, we propose Code-on-Graph (CoG), a programmatic reasoning framework for LLM-KG integration. Specifically, given the factual knowledge retrieved at each reasoning step, CoG first identifies the corresponding KG schemas and represents these schemas as Python classes, which serve as abstract interfaces to the retrieved facts. It then generates executable code grounded in these classes, with the retrieved facts instantiated as objects of the corresponding classes during execution. This design enables flexible code-based reasoning while avoiding the direct injection of large-scale factual knowledge into prompts. Experiments on WebQSP, CWQ, and GrailQA demonstrate that CoG outperforms prior state-of-the-art models by up to 10.5%.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Code-on-Graph (CoG)
1. Core Contribution
CoG addresses two well-identified bottlenecks in LLM-KG integration: inflexibility of predefined operators (which cannot express complex operations like ranking with offsets or nested filtering) and unscalability of injecting raw triples into prompts. The key insight is to abstract KG schemas into Python class definitions, generate task-specific executable code over these abstractions, and instantiate retrieved facts as objects at execution time. This effectively separates the schema-level reasoning (which enters the LLM context) from the bulk factual data (which is handled programmatically outside the context window).
The framework operates iteratively through Planning (dynamic subtask decomposition with an evaluator for adaptive termination), Coding (schema-to-class mapping and code generation), and Executing (sandboxed execution with self-correction loops). The approach is well-motivated: object-oriented abstractions are a natural fit for KG structures, and code generation provides Turing-complete expressiveness compared to fixed operator inventories.
2. Methodological Rigor
Strengths in experimental design:
Concerns:
3. Potential Impact
Direct applications: CoG's design principle—abstracting structured data into typed programming interfaces for LLM-based code generation—generalizes beyond KGs. It could be applied to relational databases, ontologies, or any structured knowledge source where schema-level abstractions exist. The "write your own tools" paradigm is a meaningful advancement over static toolkits.
Efficiency implications: The TUR analysis is particularly compelling. CoG processes 40-47× more factual units per token than PoG while maintaining comparable token budgets and runtime. This addresses a genuine scalability concern in real-world KG applications where subgraphs can be massive.
Broader influence: The work bridges program-aided reasoning (PAL, PoT) with KG reasoning, a combination that has been underexplored. It contributes to the growing literature on neuro-symbolic approaches and LLM-as-programmer paradigms.
4. Timeliness & Relevance
The paper is timely on multiple fronts:
The work arrives at a natural inflection point where LLMs are capable enough at code generation to make this approach viable, as evidenced by the strong performance even with the smaller Qwen3-Coder model.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper's framing of "writing your own tools" versus using predefined tools is a compelling narrative that connects to broader trends in autonomous agents. The JSON ablation showing only minor degradation suggests the core benefit comes from the code execution paradigm rather than specifically from class-based representation, which somewhat weakens the paper's central claim about Python classes being essential.
The error analysis revealing that 30-43% of errors stem from retrieval failures suggests that improvements to the retrieval component could yield substantial additional gains, representing low-hanging fruit for future work.
Generated Jun 3, 2026
Comparison History (22)
Paper 1 likely has higher scientific impact because it proposes a general, safety-inspired framework and twelve reliability metrics that can reshape how AI agents are evaluated across many tasks and domains. Its focus on consistency, robustness, predictability, and safety directly targets a timely, broadly relevant bottleneck for real-world deployment and could influence benchmarks, standards, and regulation. Paper 2 is a solid, innovative LLM–KG integration method with strong task gains, but its impact is more specialized to KG question answering and may be overtaken quickly in a fast-moving area.
Paper 1 introduces a novel reward attribution method for multi-agent RL with LLMs that achieved first place in a major NeurIPS 2025 competition, demonstrating an 8B model can match or surpass GPT-5. This has broader impact across RL, multi-agent systems, and LLM training. The practical demonstration of competitive performance with dramatically smaller models is highly impactful. Paper 2, while solid, offers incremental improvements to KG-QA with a programmatic reasoning framework—a more narrow contribution in a well-explored area with less transformative potential.
Paper 2 addresses a broadly impactful problem at the intersection of LLMs and knowledge graphs—two of the hottest areas in AI. Its programmatic reasoning framework (CoG) introduces a novel paradigm shift from predefined operators to code-based reasoning, with strong empirical results (up to 10.5% improvement) across multiple benchmarks. The approach has wide applicability across NLP, QA, and knowledge-intensive tasks. Paper 1, while methodologically sound and novel in applying optimal transport to Bayesian optimization, targets a narrower domain (wind farm layout optimization) with more limited cross-field impact.
Paper 2 has higher potential impact due to broader, timelier applicability: mechanism-grounded reasoning over scientific simulators targets high-stakes decision-making across many domains (engineering, climate, epidemiology, policy) and directly addresses transparency/auditability—key current concerns for AI deployment. Its schema for assumptions, dependencies, and execution traces plus constrained, evidence-grounded explanations suggests stronger methodological rigor and a clearer path to real-world adoption than KG QA gains. Paper 1 is novel and effective within KG question answering, but the scope and cross-field impact are narrower.
Paper 1 addresses a fundamental and timely question in agent training—what makes training data effective—revealing a counterintuitive 'pedagogical paradox' with significant implications for the rapidly growing field of LLM-based code agents. Its contributions (Terminal-Lego pipeline, harness engineering concept, exceptional data efficiency findings) have broad impact across agent post-training research. Paper 2, while solid, offers an incremental improvement to LLM-KG integration for question answering, a more established and narrower problem space. Paper 1's insights about training data quality over teacher strength and environment-grounded supervision are more likely to reshape research practices broadly.
Paper 1 addresses highly critical and timely bottlenecks (inflexibility and scalability) in LLM-KG integration, a rapidly expanding area of AI research. By abstracting KG facts into executable code representations, it provides a highly scalable, practical solution with significant empirical gains. While Paper 2 offers a strong foundational contribution to causal inference, Paper 1's methodology is likely to see faster, broader adoption and immediate real-world applications across the pervasive LLM ecosystem.
Paper 2 introduces a novel framework (PDA with GAM) for aggregating weak supervision signals to improve strong LLMs, addressing the fundamental challenge of scarce high-quality training data. Its broader applicability across model training paradigms, the innovative geometric alignment merging method, and demonstrated gains on diverse benchmarks (knowledge reasoning and agentic search) suggest wider impact. Paper 1, while solid with strong results on KG-QA, addresses a more specific problem (LLM-KG integration) with incremental improvements. Paper 2's insights on weak-to-strong generalization and LoRA merging have broader implications for the LLM training community.
Paper 2 likely has higher impact due to broader applicability and timeliness: programmatic LLM-on-KG reasoning addresses widely relevant issues (hallucination, scalability, compositional querying) across QA, information retrieval, databases, and agentic coding. The code-as-interface to KG schemas is a notable integration pattern that can generalize beyond QA tasks. It also reports a large empirical gain (up to 10.5%) on multiple established benchmarks. Paper 1 is innovative for multimodal RL credit assignment, but its impact may be narrower to RLVR/vision-language training regimes.
Paper 2 is more novel and broadly impactful: it isolates and quantifies a production–evaluation gap in large reasoning models via a targeted dataset (VAIR) and supports mechanisms (confirmation bias) with multiple complementary analyses (human baseline, CoT analysis, linear probes, causal patching). This speaks directly to timely concerns about LLM reliability, verification, and safety, with implications across ML, cognitive science, alignment, and evaluation methodology. Paper 1 is a strong systems contribution for KGQA, but its impact is narrower (KG integration) and more incremental relative to existing tool/code-based reasoning paradigms.
Paper 2 (Code-on-Graph) appears more novel and broadly impactful: it introduces a general, scalable LLM–knowledge graph integration paradigm using schema-induced Python classes and executable code, improving compositionality and avoiding prompt bloat. It targets widely relevant tasks (KGQA, factual reasoning) with strong benchmarks (WebQSP, CWQ, GrailQA) and sizable reported gains (up to 10.5%), suggesting methodological rigor and clear progress over SOTA. Paper 1 is timely for safety engineering, but its impact may be narrower and more dependent on dataset/metric validity and domain-specific deployment constraints.
While Paper 2 presents a strong methodological advancement in LLM-KG reasoning, Paper 1 tackles a critical bottleneck in a highly impactful domain: healthcare. By successfully bridging predictive Electronic Health Record (EHR) foundation models with the interpretable reasoning of LLMs, ChatHealthAI directly addresses the crucial need for explainable clinical decision support systems. Its potential to improve real-world patient outcomes and its relevance to the rapidly growing field of medical AI give it a higher potential for broad scientific and societal impact.
Paper 1 targets a timely, under-addressed bottleneck for long-horizon embodied agents: memory bandwidth/endurance on edge hardware. Its action-gated constant-memory design is novel relative to KV-cache and reconstruction-based memories, and it reports concrete system-level gains (constant 4,224B state; large write reductions) with closed-loop robot-policy evaluation, suggesting strong real-world applicability in robotics/AR/edge autonomy. Paper 2 is useful and likely impactful in LLM+KG QA, but programmatic reasoning/code generation over schemas is closer to existing tool/code-based LLM paradigms and its gains are incremental within a narrower application slice.
Paper 2 offers foundational insights into the mechanisms of LLM fine-tuning and alignment, uncovering how subliminal learning is driven by steering vector distillation. While Paper 1 presents a strong, practical framework for LLM-KG integration, Paper 2 addresses a fundamental, counter-intuitive phenomenon in deep learning. Its mechanistic explanation of how non-semantic data transfers semantic traits has profound implications for AI safety, interpretability, and alignment, giving it a higher potential for broad, long-lasting scientific impact across the theoretical and applied AI communities.
Paper 1 presents a highly innovative approach to LLM-KG integration by abstracting KG schemas into Python classes and utilizing code generation for reasoning. This addresses critical bottlenecks of inflexibility and context-window scalability in traditional RAG systems. Its substantial performance gains (up to 10.5%) on standard benchmarks and the broad applicability of bridging LLMs, code execution, and structured data suggest a higher potential for real-world impact and methodological adoption compared to the specialized RL reward shaping in Paper 2.
Paper 1 (Code-on-Graph) addresses fundamental limitations of LLM-KG integration with a novel programmatic reasoning framework that demonstrates strong empirical results (up to 10.5% improvement) on established benchmarks. Its approach of representing KG schemas as Python classes for code-based reasoning is innovative and broadly applicable. Paper 2 tackles an important but narrower problem (instruction following constraints) with a graph-based approach. While useful, Paper 1 has greater breadth of impact, stronger methodological novelty in bridging code generation with KG reasoning, and addresses a more foundational challenge in the LLM ecosystem.
Paper 1 addresses a highly practical and timely problem—integrating LLMs with Knowledge Graphs—offering a novel programmatic reasoning framework (CoG) with strong empirical results (up to 10.5% improvement over SOTA). It has broad applicability across NLP, question answering, and AI systems. Paper 2, while theoretically rigorous in extending non-monotonic reasoning to defeasible standpoint logic, addresses a niche area in formal logic with a narrower audience and fewer immediate real-world applications. The timeliness and breadth of impact favor Paper 1 significantly.
Paper 1 demonstrates higher scientific impact due to its strong methodological rigor and concrete empirical results. While Paper 2 presents a timely theoretical architecture for edge AI, it explicitly lacks empirical benchmarks. In contrast, Paper 1 introduces a novel programmatic reasoning framework for LLM-KG integration that solves critical scalability bottlenecks. By validating its approach on standard datasets and achieving up to a 10.5% improvement over state-of-the-art models, Paper 1 offers proven, immediate utility and broad applicability in the highly active research area of LLM reasoning and retrieval-augmented generation.
Paper 2 (Code-on-Graph) addresses a fundamental challenge in LLM-KG integration with a novel programmatic reasoning framework that demonstrates significant performance improvements (up to 10.5%) on established benchmarks. It tackles the broadly impactful problem of LLM hallucination and knowledge limitations, which is highly timely given the widespread adoption of LLMs. Paper 1 (scTranslation) provides a valuable benchmark for single-cell multi-omics translation but is more incremental as a benchmarking study rather than introducing a fundamentally new method. Paper 2's broader applicability across AI/NLP gives it higher potential impact.
Paper 1 presents a novel programmatic reasoning framework (Code-on-Graph) for LLM-KG integration that addresses fundamental limitations of existing approaches with strong empirical results (up to 10.5% improvement over SOTA). It introduces innovative technical contributions—representing KG schemas as Python classes and using code generation for reasoning—with broad applicability across knowledge-intensive NLP tasks. Paper 2, while practically useful, is primarily an engineering contribution combining existing evaluation dimensions into a resource-efficient pipeline without significant methodological novelty. Paper 1 has greater potential to influence future research directions in knowledge-grounded reasoning.
Paper 1 is likely to have higher scientific impact due to broader relevance and novelty: it proposes a general LLM–knowledge graph integration paradigm (schema-to-code, executable reasoning) that addresses scalability and compositionality limits of prompt-injection retrieval, with strong gains across multiple standard KGQA benchmarks. This could influence LLM tool-use, neuro-symbolic reasoning, and retrieval-augmented systems beyond QA. Paper 2 is timely and practically valuable for multi-agent reliability, but is narrower (failure attribution on a specific benchmark) and more incremental in methodology (feature encoding + temporal/attention modeling), likely yielding more limited cross-field impact.