Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering

Mateusz Czyżnikiewicz, Ryszard Tuora, Adam Kozakiewicz, Tomasz Ziętkiewicz, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Timothy Hospedales

May 26, 2026

arXiv:2605.27164v1 PDF

cs.AI(primary)

#1339of 2682·Artificial Intelligence

#1339 of 2682 · Artificial Intelligence

Tournament Score

1411±43

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance5.5

Rigor6.5

Novelty5

Clarity7.5

Tournament Score

1411±43

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Retrieval-Augmented Generation (RAG) systems for question answering typically retrieve evidence by semantic similarity between the query and document chunks. While effective for unstructured text, this approach is less reliable on semi-structured corpora where answering may require exact filtering, aggregation, or exhaustive retrieval over structured attributes across multiple documents. Symbolic approaches support such operations, but they are often brittle on noisy natural-language corpora. We address this gap with DualGraph, a RAG framework that represents documents through two complementary views: a Textual Knowledge Graph for semantic retrieval and a Symbolic Knowledge Graph for symbolic querying over typed subject--predicate--object triples. Building on these two components, we provide multiple strategies for selecting or combining semantic and symbolic evidence.We also introduce SpecsQA, a benchmark from a commercial shopping website with semi-structured product documents and manually curated questions spanning open-ended and specification-oriented retrieval. Experiments show that DualGraph consistently outperforms state-of-the-art dense-retrieval, GraphRAG, symbolic, and table-oriented baselines across question types.Code and data are available at https://github.com/corneliocristina/DualGraphRAG.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering"

1. Core Contribution

This paper addresses a genuine gap in RAG systems: the inability to handle semi-structured corpora where answering requires both semantic understanding and exact symbolic operations (filtering, aggregation, exhaustive enumeration). The proposed DualGraph framework constructs two complementary graph views from the same document corpus—a Textual Knowledge Graph (TKG) for semantic retrieval and a Symbolic Knowledge Graph (SKG) for formal SPARQL-based querying over typed triples. Multiple orchestration strategies (fallback, routing, concatenation, agentic) combine these views.

Additionally, the paper introduces SpecsQA, a benchmark of 117 manually curated questions over 2,162 Samsung UK product pages, spanning inverse queries, multi-condition filtering, group comparisons, and open-ended reasoning. The dataset includes both natural-language and canonical product-list answers, enabling both soft and deterministic evaluation.

2. Methodological Rigor

Strengths in methodology:

The experimental design is thorough: 11 baselines spanning dense retrieval, GraphRAG variants, symbolic, table-oriented, and agentic systems are compared. Results are averaged over 5 indexing runs, 3 query-generation runs, and 3 evaluation runs, which helps address stochastic variability.

Four complementary metrics (factual correctness, list matching F1, pairwise LLM-as-a-judge, and token cost) provide a well-rounded evaluation.

The ablation studies are comprehensive: orchestration strategy ablation (Table 3), pattern ablation for SPARQL generation (Table 4), category-level breakdowns, and objective/subjective splits all provide diagnostic value.

Pareto-front analysis of quality vs. computational cost (Figures 8-10) is informative for practical deployment considerations.

Weaknesses in methodology:

The benchmark contains only 117 questions, which is quite small. Statistical significance is not formally tested, and with this sample size, differences between methods may not be robust. The per-category splits (as few as 24 questions for group comparison) further reduce statistical power.

The SKG construction relies on a manually designed ontology and hand-crafted Datalog rules. While the authors acknowledge this, it significantly limits generalizability claims. The schema is tailored to product specifications, and extending to other domains requires non-trivial expert effort.

All experiments use a single LLM (GPT-OSS-120B) and a single embedding model. Generalization across model families is untested.

The LLM-as-a-judge metric shows clear verbosity bias (Figure 6, acknowledged by authors), which partially undermines the agentic variant's seemingly strong LaaJ performance.

Absolute scores across all systems are quite low (best list-matching F1 ~0.37, best factual correctness ~0.31), suggesting the benchmark may be extremely challenging, but also raising questions about whether any method is practically useful yet.

3. Potential Impact

The paper addresses a practical and commercially relevant scenario: QA over product catalogs, enterprise documentation, and customer-support systems where structured specifications coexist with unstructured text. This is a common real-world setting where existing RAG systems demonstrably struggle.

The dual-graph architecture is conceptually clean and could inspire similar hybrid approaches in domains like healthcare records (structured lab values + clinical notes), legal documents (structured clauses + free text), or scientific databases. The general principle of maintaining parallel symbolic and semantic representations with fallback mechanisms is broadly applicable.

However, impact may be limited by:

The domain-specific engineering required for the SKG (ontology, Datalog rules, SPARQL pattern design), which reduces out-of-the-box applicability.

The relatively narrow domain of the benchmark (single e-commerce website, single product ecosystem).

The small benchmark size limiting its adoption as a community standard.

4. Timeliness & Relevance

The paper is highly timely. RAG systems are being widely deployed, and their limitations on semi-structured data are increasingly apparent in industrial applications. The observation that neither pure semantic retrieval nor pure symbolic approaches suffice—and that their combination yields substantial gains—is a valuable empirical finding for the RAG community.

The paper also arrives at a moment when GraphRAG is gaining significant traction (Microsoft GraphRAG, HippoRAG, etc.), and the demonstrated superiority of DualGraph over these systems on specification-heavy queries provides useful guidance for practitioners.

5. Strengths & Limitations

Key Strengths:

Clean conceptual framework: The TKG/SKG duality is intuitive, well-motivated, and clearly presented.

Comprehensive baselines: Comparison against 11 diverse systems across multiple paradigms is commendable.

Practical relevance: The e-commerce QA scenario is commercially important and underserved by existing benchmarks.

Diagnostic value: Category-level and objective/subjective breakdowns reveal when and why different approaches succeed or fail.

Cost-aware evaluation: Reporting token usage alongside quality metrics is valuable for real-world deployment decisions.

Code and data release: Open availability enables reproducibility.

Key Limitations:

Small benchmark: 117 questions is insufficient for a community benchmark. Comparison datasets like HotpotQA (113K) or even domain-specific ones like FinQA (8.3K) are orders of magnitude larger.

Manual engineering overhead: The SKG pipeline requires domain expertise (ontology design, Datalog rules, pattern templates), limiting scalability to new domains.

Single-domain evaluation: All experiments use Samsung product pages. Cross-domain generalization is entirely untested.

Learned alignment underperforms: The contrastive alignment between TKG and SKG provided no measurable improvement over simple heuristics, suggesting the integration between the two views remains shallow.

Agentic SPARQL refinement failed: The iterative SPARQL refinement approach didn't help, suggesting fundamental challenges in NL-to-SPARQL translation that remain unresolved.

Low absolute performance: Even the best system achieves modest scores, leaving substantial room for improvement and raising questions about practical utility.

Overall Assessment

This is a solid systems paper that identifies a real limitation in current RAG architectures, proposes a principled hybrid solution, and demonstrates its effectiveness through comprehensive experiments. The dual-graph concept is sound and the engineering is careful. However, the scientific contribution is somewhat incremental—combining symbolic and semantic retrieval is a well-known idea, and the specific implementation choices (manual ontology, heuristic alignment) limit generalizability. The benchmark, while useful, is too small to become a standard community resource. The paper's primary value lies in its empirical demonstration that hybrid retrieval substantially outperforms pure approaches on semi-structured QA, and in providing a concrete, reproducible implementation of this principle.

Rating:5.8/ 10

Significance 5.5Rigor 6.5Novelty 5Clarity 7.5

Generated May 27, 2026

Comparison History (16)

vs. Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI

claude-opus-4.65/28/2026

Paper 2 addresses a practical and broadly applicable problem in RAG systems with a concrete, reproducible framework (DualGraph) and publicly available code/data. It tackles the important gap between semantic and symbolic retrieval for semi-structured data, which has immediate real-world applications across many domains. Paper 1 proposes an interesting ethical pluralism framework but operates in a narrower niche with a relatively small custom benchmark (450 cases), and its real-world applicability to AI alignment remains more speculative. Paper 2's dual-view approach is more likely to be adopted and extended by the broader NLP/IR community.

vs. LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

gemini-3.15/27/2026

Paper 2 proposes a novel methodological framework (DualGraph) that bridges a critical gap in Retrieval-Augmented Generation (RAG) by elegantly combining semantic retrieval with symbolic querying for semi-structured data. This addresses a major challenge in enterprise and commercial AI applications, offering broader real-world utility and methodological innovation compared to Paper 1, which primarily focuses on benchmarking existing models in educational scenarios.

vs. Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation

claude-opus-4.65/27/2026

Paper 2 introduces both a novel framework (DualGraph) and a new benchmark (SpecsQA) for semi-structured QA, addressing a practical and widely relevant gap in RAG systems. It provides open-source code and data, enabling reproducibility and adoption. The dual semantic-symbolic approach has broad applicability across domains with semi-structured data. Paper 1 addresses the important but narrower problem of multi-stakeholder LLM alignment with a decomposition method, which is more niche. Paper 2's concrete benchmark, broader applicability to the booming RAG field, and released resources give it higher potential impact.

vs. VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

gpt-5.25/27/2026

Paper 1 has higher likely scientific impact due to stronger timeliness and broader relevance: evaluating long-term personalization and proactive behavior is a central bottleneck for deploying LLM agents across many domains (assistants, healthcare, education, productivity). Its benchmark design (temporally ordered interactions, fragmented preference signals, proactiveness tests) and extensible memory interface can standardize evaluation and drive model/system research. Paper 2 is methodologically solid and highly applicable to e-commerce/spec QA, but its impact is more domain-specific and aligns with an active, already-crowded RAG/structured QA direction.

vs. NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding

claude-opus-4.65/27/2026

NeurIPS introduces a fundamentally novel approach to brain decoding by reframing anatomical variation as an inductive prior rather than a nuisance variable, achieving dramatic efficiency gains (10 vs 600 epochs) and strong scalability. This has broader implications for neuroscience, BCI applications, and geometric deep learning. Paper 1, while solid engineering combining semantic and symbolic retrieval for semi-structured QA, represents more incremental progress in the well-explored RAG space. Paper 2's cross-disciplinary impact (neuroscience + ML), principled methodology with causal ablations, and potential for clinical applications give it higher impact potential.

vs. Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

gemini-3.15/27/2026

Paper 1 offers higher potential scientific impact because it addresses a critical bottleneck in medical AI: the lack of verifiable and interpretable reasoning in LLMs. By introducing a neuro-symbolic framework combining fuzzy logic with LLMs, it provides a high-stakes real-world application (clinical diagnosis) with rigorous, auditable inference paths. While Paper 2 presents a valuable methodological improvement for RAG systems, Paper 1 tackles a deeply impactful, life-critical domain where solving the transparency and hallucination problems of LLMs can fundamentally transform clinical decision-making.

vs. SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

gemini-3.15/27/2026

While Paper 1 introduces a valuable synthetic benchmark for the niche but growing field of LLM-based GUI agents, Paper 2 tackles a more ubiquitous problem: RAG over semi-structured data. Because RAG is universally adopted across enterprise and academic applications, DualGraph's novel combination of textual and symbolic knowledge graphs offers immediate, high-impact improvements to a much broader range of real-world systems. Furthermore, providing both a robust methodology and a new dataset (SpecsQA) ensures strong methodological rigor and widespread utility across multiple NLP and information retrieval domains.

vs. VeriTrace: Evolving Mental Models for Deep Research Agents

gpt-5.25/27/2026

Paper 1 likely has higher scientific impact due to a clearer methodological contribution (DualGraph combining symbolic and semantic retrieval) plus a new real-world benchmark (SpecsQA) that can standardize evaluation for semi-structured QA. The dataset and code release enable broad adoption and reproducibility, and the problem (RAG failures on semi-structured/product-like corpora) is widely relevant across search, e-commerce, enterprise QA, and knowledge-intensive NLP. Paper 2 is timely for agent research, but its impact may be narrower and harder to validate long-term without widely adopted benchmarks/artifacts beyond reported gains.

vs. Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?

gpt-5.25/27/2026

Paper 2 has higher estimated impact due to broader applicability and timeliness: improving RAG for semi-structured corpora is a widely shared bottleneck across e-commerce, enterprise search, and technical QA. It contributes both a method (DualGraph combining semantic retrieval with symbolic querying) and a new benchmark (SpecsQA), which can catalyze follow-on work and standardize evaluation. Methodologically, the dual-view design and comparisons across diverse baselines suggest solid rigor. Paper 1 is novel and valuable for biomedical hypothesis contextualization, but its impact is more domain-specific and may generalize less broadly.

vs. OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

gemini-3.15/27/2026

Paper 1 addresses a fundamental limitation in evaluating Theory of Mind (ToM) in LLMs by shifting the paradigm from simple end-point QA to explicit belief representation tracking. This offers profound theoretical insights into LLM reasoning capabilities and cognitive modeling. While Paper 2 presents a highly practical and relevant RAG framework for semi-structured data, Paper 1's focus on deep cognitive evaluation has broader implications for understanding and developing AGI, granting it higher potential for long-term scientific impact across AI and cognitive science.

vs. HeartBeatAI: An Interpretable and Robust Deep Learning Framework for Multi-Label ECG Arrhythmia Detection

claude-opus-4.65/27/2026

Paper 2 introduces both a novel framework (DualGraph) and a new benchmark dataset (SpecsQA) addressing an underexplored gap in RAG systems for semi-structured data. The combination of semantic and symbolic retrieval is innovative and broadly applicable across NLP, information retrieval, and e-commerce. Paper 1, while solid, presents incremental improvements to ECG classification with known techniques (SE-ResNet, MixStyle) and acknowledges significant limitations in cross-domain generalization. Paper 2's open-source code/data and broader applicability across fields give it higher potential impact.

vs. FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact due to broader applicability and timeliness: improving RAG on semi-structured data is central to many production QA/search systems. DualGraph’s dual representation (semantic + symbolic KG) addresses a well-known limitation of pure semantic retrieval, enabling exact filtering/aggregation and better evidence completeness. The introduction of SpecsQA, sourced from real commercial product specs, increases practical relevance and potential for benchmarking. While Paper 1 advances CLIP fine-tuning for long captions, its impact is more specialized within vision-language alignment.

vs. Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents

claude-opus-4.65/27/2026

Paper 2 addresses a more fundamental and broadly applicable challenge—personalized memory for LLM agents—which impacts the rapidly growing field of long-horizon AI agents across many domains. It introduces both a novel benchmark and a new framework for an underexplored problem (personalized memory policies), with implications for virtually all LLM-based applications. Paper 1, while solid, addresses a more niche problem (semi-structured QA with dual graph RAG) with narrower applicability. Paper 2's timeliness and breadth of potential impact across the booming agent ecosystem give it an edge.

vs. MemFail: Stress-Testing Failure Modes of LLM Memory Systems

claude-opus-4.65/27/2026

MemFail addresses a more fundamental and broadly applicable problem—understanding failure modes of LLM memory systems, which is critical as LLM agents become ubiquitous. Its diagnostic framework formalizing memory operations (summarization, storage, retrieval) and identifying failure modes provides a reusable methodology applicable across many domains. Paper 1, while solid, addresses a more niche problem (semi-structured QA) with an incremental engineering contribution. Paper 2's potential to influence the design of future LLM memory architectures gives it broader and more lasting impact.

vs. UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

gpt-5.25/27/2026

Paper 1 likely has higher impact due to broader scope and timeliness: it proposes a general reinforcement-learning optimization framework for LLM-based multi-agent workflows (a rapidly growing paradigm), with abstractions for role-specific credit assignment and parameter sharing that can apply across many agentic applications. Its contributions are methodological and infrastructural, enabling reusable post-training across domains, which increases breadth of impact. Paper 2 is strong and practical (new benchmark + hybrid symbolic/semantic RAG for semi-structured QA) but is more domain-specific, so its impact is likely narrower despite solid applications.

vs. Solving Combinatorial Counting Problems with Weighted First-Order Model Counting

gpt-5.25/27/2026

Paper 2 likely has higher impact due to timeliness and broad applicability: improving RAG on semi-structured data is a widely felt bottleneck in real-world QA (e-commerce, enterprise, biomedical). It contributes both a method (DualGraph combining semantic and symbolic evidence) and a new benchmark (SpecsQA) with code/data, increasing adoption and follow-on work. Methodology is empirically validated against strong baselines. Paper 1 is novel and rigorous but targets a more specialized niche (WFOMC-based combinatorial counting), likely yielding narrower uptake despite strong theoretical contribution.