Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering
Mateusz Czyżnikiewicz, Ryszard Tuora, Adam Kozakiewicz, Tomasz Ziętkiewicz, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Timothy Hospedales
Abstract
Retrieval-Augmented Generation (RAG) systems for question answering typically retrieve evidence by semantic similarity between the query and document chunks. While effective for unstructured text, this approach is less reliable on semi-structured corpora where answering may require exact filtering, aggregation, or exhaustive retrieval over structured attributes across multiple documents. Symbolic approaches support such operations, but they are often brittle on noisy natural-language corpora. We address this gap with DualGraph, a RAG framework that represents documents through two complementary views: a Textual Knowledge Graph for semantic retrieval and a Symbolic Knowledge Graph for symbolic querying over typed subject--predicate--object triples. Building on these two components, we provide multiple strategies for selecting or combining semantic and symbolic evidence.We also introduce SpecsQA, a benchmark from a commercial shopping website with semi-structured product documents and manually curated questions spanning open-ended and specification-oriented retrieval. Experiments show that DualGraph consistently outperforms state-of-the-art dense-retrieval, GraphRAG, symbolic, and table-oriented baselines across question types.Code and data are available at https://github.com/corneliocristina/DualGraphRAG.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering"
1. Core Contribution
This paper addresses a genuine gap in RAG systems: the inability to handle semi-structured corpora where answering requires both semantic understanding and exact symbolic operations (filtering, aggregation, exhaustive enumeration). The proposed DualGraph framework constructs two complementary graph views from the same document corpus—a Textual Knowledge Graph (TKG) for semantic retrieval and a Symbolic Knowledge Graph (SKG) for formal SPARQL-based querying over typed triples. Multiple orchestration strategies (fallback, routing, concatenation, agentic) combine these views.
Additionally, the paper introduces SpecsQA, a benchmark of 117 manually curated questions over 2,162 Samsung UK product pages, spanning inverse queries, multi-condition filtering, group comparisons, and open-ended reasoning. The dataset includes both natural-language and canonical product-list answers, enabling both soft and deterministic evaluation.
2. Methodological Rigor
Strengths in methodology:
Weaknesses in methodology:
3. Potential Impact
The paper addresses a practical and commercially relevant scenario: QA over product catalogs, enterprise documentation, and customer-support systems where structured specifications coexist with unstructured text. This is a common real-world setting where existing RAG systems demonstrably struggle.
The dual-graph architecture is conceptually clean and could inspire similar hybrid approaches in domains like healthcare records (structured lab values + clinical notes), legal documents (structured clauses + free text), or scientific databases. The general principle of maintaining parallel symbolic and semantic representations with fallback mechanisms is broadly applicable.
However, impact may be limited by:
4. Timeliness & Relevance
The paper is highly timely. RAG systems are being widely deployed, and their limitations on semi-structured data are increasingly apparent in industrial applications. The observation that neither pure semantic retrieval nor pure symbolic approaches suffice—and that their combination yields substantial gains—is a valuable empirical finding for the RAG community.
The paper also arrives at a moment when GraphRAG is gaining significant traction (Microsoft GraphRAG, HippoRAG, etc.), and the demonstrated superiority of DualGraph over these systems on specification-heavy queries provides useful guidance for practitioners.
5. Strengths & Limitations
Key Strengths:
Key Limitations:
Overall Assessment
This is a solid systems paper that identifies a real limitation in current RAG architectures, proposes a principled hybrid solution, and demonstrates its effectiveness through comprehensive experiments. The dual-graph concept is sound and the engineering is careful. However, the scientific contribution is somewhat incremental—combining symbolic and semantic retrieval is a well-known idea, and the specific implementation choices (manual ontology, heuristic alignment) limit generalizability. The benchmark, while useful, is too small to become a standard community resource. The paper's primary value lies in its empirical demonstration that hybrid retrieval substantially outperforms pure approaches on semi-structured QA, and in providing a concrete, reproducible implementation of this principle.
Generated May 27, 2026
Comparison History (16)
Paper 2 addresses a practical and broadly applicable problem in RAG systems with a concrete, reproducible framework (DualGraph) and publicly available code/data. It tackles the important gap between semantic and symbolic retrieval for semi-structured data, which has immediate real-world applications across many domains. Paper 1 proposes an interesting ethical pluralism framework but operates in a narrower niche with a relatively small custom benchmark (450 cases), and its real-world applicability to AI alignment remains more speculative. Paper 2's dual-view approach is more likely to be adopted and extended by the broader NLP/IR community.
Paper 2 proposes a novel methodological framework (DualGraph) that bridges a critical gap in Retrieval-Augmented Generation (RAG) by elegantly combining semantic retrieval with symbolic querying for semi-structured data. This addresses a major challenge in enterprise and commercial AI applications, offering broader real-world utility and methodological innovation compared to Paper 1, which primarily focuses on benchmarking existing models in educational scenarios.
Paper 2 introduces both a novel framework (DualGraph) and a new benchmark (SpecsQA) for semi-structured QA, addressing a practical and widely relevant gap in RAG systems. It provides open-source code and data, enabling reproducibility and adoption. The dual semantic-symbolic approach has broad applicability across domains with semi-structured data. Paper 1 addresses the important but narrower problem of multi-stakeholder LLM alignment with a decomposition method, which is more niche. Paper 2's concrete benchmark, broader applicability to the booming RAG field, and released resources give it higher potential impact.
Paper 1 has higher likely scientific impact due to stronger timeliness and broader relevance: evaluating long-term personalization and proactive behavior is a central bottleneck for deploying LLM agents across many domains (assistants, healthcare, education, productivity). Its benchmark design (temporally ordered interactions, fragmented preference signals, proactiveness tests) and extensible memory interface can standardize evaluation and drive model/system research. Paper 2 is methodologically solid and highly applicable to e-commerce/spec QA, but its impact is more domain-specific and aligns with an active, already-crowded RAG/structured QA direction.
NeurIPS introduces a fundamentally novel approach to brain decoding by reframing anatomical variation as an inductive prior rather than a nuisance variable, achieving dramatic efficiency gains (10 vs 600 epochs) and strong scalability. This has broader implications for neuroscience, BCI applications, and geometric deep learning. Paper 1, while solid engineering combining semantic and symbolic retrieval for semi-structured QA, represents more incremental progress in the well-explored RAG space. Paper 2's cross-disciplinary impact (neuroscience + ML), principled methodology with causal ablations, and potential for clinical applications give it higher impact potential.
Paper 1 offers higher potential scientific impact because it addresses a critical bottleneck in medical AI: the lack of verifiable and interpretable reasoning in LLMs. By introducing a neuro-symbolic framework combining fuzzy logic with LLMs, it provides a high-stakes real-world application (clinical diagnosis) with rigorous, auditable inference paths. While Paper 2 presents a valuable methodological improvement for RAG systems, Paper 1 tackles a deeply impactful, life-critical domain where solving the transparency and hallucination problems of LLMs can fundamentally transform clinical decision-making.
While Paper 1 introduces a valuable synthetic benchmark for the niche but growing field of LLM-based GUI agents, Paper 2 tackles a more ubiquitous problem: RAG over semi-structured data. Because RAG is universally adopted across enterprise and academic applications, DualGraph's novel combination of textual and symbolic knowledge graphs offers immediate, high-impact improvements to a much broader range of real-world systems. Furthermore, providing both a robust methodology and a new dataset (SpecsQA) ensures strong methodological rigor and widespread utility across multiple NLP and information retrieval domains.
Paper 1 likely has higher scientific impact due to a clearer methodological contribution (DualGraph combining symbolic and semantic retrieval) plus a new real-world benchmark (SpecsQA) that can standardize evaluation for semi-structured QA. The dataset and code release enable broad adoption and reproducibility, and the problem (RAG failures on semi-structured/product-like corpora) is widely relevant across search, e-commerce, enterprise QA, and knowledge-intensive NLP. Paper 2 is timely for agent research, but its impact may be narrower and harder to validate long-term without widely adopted benchmarks/artifacts beyond reported gains.
Paper 2 has higher estimated impact due to broader applicability and timeliness: improving RAG for semi-structured corpora is a widely shared bottleneck across e-commerce, enterprise search, and technical QA. It contributes both a method (DualGraph combining semantic retrieval with symbolic querying) and a new benchmark (SpecsQA), which can catalyze follow-on work and standardize evaluation. Methodologically, the dual-view design and comparisons across diverse baselines suggest solid rigor. Paper 1 is novel and valuable for biomedical hypothesis contextualization, but its impact is more domain-specific and may generalize less broadly.
Paper 1 addresses a fundamental limitation in evaluating Theory of Mind (ToM) in LLMs by shifting the paradigm from simple end-point QA to explicit belief representation tracking. This offers profound theoretical insights into LLM reasoning capabilities and cognitive modeling. While Paper 2 presents a highly practical and relevant RAG framework for semi-structured data, Paper 1's focus on deep cognitive evaluation has broader implications for understanding and developing AGI, granting it higher potential for long-term scientific impact across AI and cognitive science.
Paper 2 introduces both a novel framework (DualGraph) and a new benchmark dataset (SpecsQA) addressing an underexplored gap in RAG systems for semi-structured data. The combination of semantic and symbolic retrieval is innovative and broadly applicable across NLP, information retrieval, and e-commerce. Paper 1, while solid, presents incremental improvements to ECG classification with known techniques (SE-ResNet, MixStyle) and acknowledges significant limitations in cross-domain generalization. Paper 2's open-source code/data and broader applicability across fields give it higher potential impact.
Paper 2 likely has higher scientific impact due to broader applicability and timeliness: improving RAG on semi-structured data is central to many production QA/search systems. DualGraph’s dual representation (semantic + symbolic KG) addresses a well-known limitation of pure semantic retrieval, enabling exact filtering/aggregation and better evidence completeness. The introduction of SpecsQA, sourced from real commercial product specs, increases practical relevance and potential for benchmarking. While Paper 1 advances CLIP fine-tuning for long captions, its impact is more specialized within vision-language alignment.
Paper 2 addresses a more fundamental and broadly applicable challenge—personalized memory for LLM agents—which impacts the rapidly growing field of long-horizon AI agents across many domains. It introduces both a novel benchmark and a new framework for an underexplored problem (personalized memory policies), with implications for virtually all LLM-based applications. Paper 1, while solid, addresses a more niche problem (semi-structured QA with dual graph RAG) with narrower applicability. Paper 2's timeliness and breadth of potential impact across the booming agent ecosystem give it an edge.
MemFail addresses a more fundamental and broadly applicable problem—understanding failure modes of LLM memory systems, which is critical as LLM agents become ubiquitous. Its diagnostic framework formalizing memory operations (summarization, storage, retrieval) and identifying failure modes provides a reusable methodology applicable across many domains. Paper 1, while solid, addresses a more niche problem (semi-structured QA) with an incremental engineering contribution. Paper 2's potential to influence the design of future LLM memory architectures gives it broader and more lasting impact.
Paper 1 likely has higher impact due to broader scope and timeliness: it proposes a general reinforcement-learning optimization framework for LLM-based multi-agent workflows (a rapidly growing paradigm), with abstractions for role-specific credit assignment and parameter sharing that can apply across many agentic applications. Its contributions are methodological and infrastructural, enabling reusable post-training across domains, which increases breadth of impact. Paper 2 is strong and practical (new benchmark + hybrid symbolic/semantic RAG for semi-structured QA) but is more domain-specific, so its impact is likely narrower despite solid applications.
Paper 2 likely has higher impact due to timeliness and broad applicability: improving RAG on semi-structured data is a widely felt bottleneck in real-world QA (e-commerce, enterprise, biomedical). It contributes both a method (DualGraph combining semantic and symbolic evidence) and a new benchmark (SpecsQA) with code/data, increasing adoption and follow-on work. Methodology is empirically validated against strong baselines. Paper 1 is novel and rigorous but targets a more specialized niche (WFOMC-based combinatorial counting), likely yielding narrower uptake despite strong theoretical contribution.