From Citation Selection to Citation Absorption: A Measurement Framework for Generative Engine Optimization Across AI Search Platforms

Zhang Kai, He Xinyue, Yao Jingang

Apr 28, 2026arXiv:2604.25707v2

cs.IR

v1v2

#327of 655·cs.IR

#327 of 655 · cs.IR

Tournament Score

1410±32

11001750

33%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5.5

Rigor4

Novelty5

Clarity6

Abstract

Generative search engines increasingly determine whether online information is merely discoverable, cited as a source, or actually absorbed into generated answers. This paper proposes a two-stage measurement framework for Generative Engine Optimization (GEO): citation selection, where a platform triggers search and chooses sources, and citation absorption, where a cited page contributes language, evidence, structure, or factual support to the final answer. We analyze the public geo-citation-lab dataset covering 602 controlled prompts across ChatGPT, Google AI Overview/Gemini, and Perplexity; 21,143 valid search-layer citations; 23,745 citation-level feature records; 18,151 successfully fetched pages; and 72 extracted features. The central descriptive finding is that citation breadth and citation depth diverge. Perplexity and Google cite more sources on average, while ChatGPT cites fewer sources but shows substantially higher average citation influence among fetched pages. High-influence pages tend to be longer, more structured, semantically aligned, and richer in extractable evidence such as definitions, numerical facts, comparisons, and procedural steps. The results suggest that GEO should be measured beyond citation counts, with answer-level absorption treated as a separate outcome.

AI Impact Assessments

(3 models)

Scientific Impact Assessment

Core Contribution

This paper proposes a two-stage measurement framework for Generative Engine Optimization (GEO) that distinguishes between citation selection (whether a source is chosen by a generative search engine) and citation absorption (whether that source materially shapes the generated answer). Using a public dataset of 602 controlled prompts across ChatGPT, Google AI Overview/Gemini, and Perplexity, the authors document a central empirical finding: citation breadth and citation depth diverge sharply. ChatGPT cites fewer sources (mean 6.88) but shows substantially higher per-citation influence (0.2713), while Perplexity cites broadly (mean 16.35) with low per-citation influence (0.0646).

The conceptual contribution—separating selection from absorption—is intuitive but genuinely useful for an emerging field that has largely measured visibility through citation counts alone. The "evidence-container hypothesis" (pages that package modular, semantically aligned, extractable evidence are more deeply absorbed) provides a coherent organizing principle.

Methodological Rigor

The paper is unusually self-aware about its methodological limitations, which is both a strength and a weakness. The authors explicitly avoid causal claims, fabricated p-values, or regression coefficients, instead presenting a four-level identification hierarchy (direct counts → descriptive contrasts → mechanistic interpretations → causal prescriptions) and claiming only the first two levels as empirical findings. This epistemic discipline is commendable and rare in applied IR research.

However, this conservatism also means the paper delivers relatively little beyond descriptive statistics and mean comparisons. The influence_score is a hand-crafted weighted composite (0.20·ref_count + 0.15·position + 0.20·coverage + 0.25·TF-IDF + 0.20·n-gram overlap), and the authors correctly note its constructed nature but never validate it against ground truth. Without any human annotation of actual absorption, sentence-level attribution analysis, or comparison with model internals, the core dependent variable remains an untested proxy. The large ChatGPT-vs-others influence gap (0.27 vs ~0.06) could plausibly arise from systematic differences in answer length, citation rendering format, or HTML parsing artifacts rather than genuine absorption differences.

The dataset design with four prompt layers (main, style, language, scenario) provides useful controlled variation, but the 602 prompts are researcher-designed rather than sampled from real user traffic, limiting external validity. The 76.44% fetch success rate introduces non-random missingness that is acknowledged but not modeled.

The paper specifies confirmatory models (negative binomial for selection, fractional logit for absorption) but does not execute them, instead leaving them as a "transparent analysis plan." While intellectually honest, this means the paper essentially presents summary tables without any inferential statistics—no confidence intervals, no significance tests, no multivariate controls. This is a substantial gap for a paper analyzing 23,745 records.

Potential Impact

The selection-absorption distinction could become a standard conceptual framework for GEO research, similar to how impressions vs. clicks structured SEO measurement. The practical implications—design pages as "evidence containers" with definitions, statistics, comparisons, and modular structure rather than simply adopting Q&A formatting—are immediately actionable for publishers and content strategists.

The negative Q&A finding (Q&A pages show -5.74% relative influence difference) is genuinely useful and counter-intuitive, challenging a widespread industry heuristic. Similarly, the finding that news sources are frequently selected but weakly absorbed while encyclopedia pages show the opposite pattern provides meaningful strategic insight.

The cross-platform comparison fills a gap, as most GEO work has studied single platforms or used proprietary benchmarks. The public dataset and reproducibility checklist lower barriers for follow-up research.

However, the practical impact is limited by the lack of causal evidence. The paper repeatedly and correctly warns against converting correlations into optimization tactics, but this also means practitioners cannot confidently act on the findings.

Timeliness & Relevance

GEO is genuinely timely as generative search engines rapidly replace traditional search for many queries. The paper addresses a real measurement gap: the field needs frameworks beyond citation counting. The conceptual vocabulary (citation breadth vs. depth, evidence containers, selection vs. absorption) fills an emerging need.

The paper cites several 2025-2026 references, positioning itself within a fast-moving research front. The dataset covers three major platforms that collectively represent the generative search market.

Strengths

1. Conceptual clarity: The selection-absorption distinction is well-articulated and genuinely useful.

2. Epistemic discipline: The four-level identification hierarchy and claim-level self-audit set a high standard for GEO research, which is prone to marketing-driven overclaiming.

3. Counter-intuitive findings: The Q&A formatting result, the citation breadth-depth divergence, and the news selection-absorption split are all informative.

4. Public data and reproducibility infrastructure: The dataset, analysis plan, and detailed data dictionary support replication.

5. Cross-platform scope: Comparing three major platforms under controlled conditions is valuable.

Limitations

1. No validation of the influence score: The core dependent variable is an unvalidated proxy. The 4.6x gap between ChatGPT and other platforms could be an artifact.

2. Purely descriptive: No inferential statistics, no multivariate analysis, no confidence intervals despite ample data. The confirmatory models are specified but not executed.

3. Static snapshot: Single time-point data for rapidly evolving platforms limits shelf life.

4. No ground truth: No human annotation of actual absorption, no sentence-level attribution analysis.

5. Independent researchers without institutional affiliation: While not inherently problematic, the lack of peer review infrastructure and the self-referential dataset provenance (authors created and analyzed their own dataset) warrant scrutiny.

6. Verbose presentation: The paper is heavily padded with caveats, checklists, and repeated disclaimers that could be condensed, making the actual empirical contribution feel thin relative to the paper's length.

Overall Assessment

This paper introduces a useful conceptual framework and provides interesting descriptive findings for an emerging field. However, it stops short of delivering the analytical depth its dataset enables. It reads more as a well-documented exploratory data analysis with a thoughtful measurement proposal than as a complete empirical study. The contribution is primarily conceptual and descriptive, with the empirical work serving as illustration rather than rigorous evidence.

Rating:4.5/ 10

Significance 5.5Rigor 4Novelty 5Clarity 6

Generated Apr 30, 2026

Comparison History (45)

Lostvs. DADF: A Distribution-Aware Debiasing Framework for Watch-Time Regression in Recommender Systems

Paper 2 (DADF) demonstrates higher scientific impact potential due to several factors: (1) it addresses a well-defined, broadly applicable problem (long-tailed regression bias) with a rigorous, reproducible framework validated on both public benchmarks and large-scale industrial systems; (2) it shows concrete, measurable improvements including online A/B test results, demonstrating real-world applicability; (3) the methodology is generalizable beyond recommender systems to other long-tailed regression tasks; (4) open-source code availability enhances reproducibility. Paper 1 introduces a useful measurement framework for GEO but is primarily descriptive, analyzes a single dataset, and addresses a narrower, more niche audience in SEO/information retrieval.

claude-opus-4-6·May 19, 2026

Wonvs. RAGR: Review-Augmented Generative Recommendation

Paper 1 addresses the highly timely and disruptive field of Generative Engine Optimization (GEO). While Paper 2 offers a solid methodological improvement for recommender systems, Paper 1 pioneers a novel measurement framework for how LLM-based search engines select and absorb web information. This has profound, cross-disciplinary implications for information retrieval, web economics, and NLP, giving it significantly broader real-world application and higher potential scientific impact than the narrower domain of sequential recommendation.

gemini-3.1-pro-preview·May 19, 2026

Lostvs. Towards Sustainable Growth: A Multi-Value-Aware Retrieval Framework for E-Commerce Search

Paper 2 has higher likely scientific impact: it introduces a novel, deployable retrieval framework that explicitly optimizes multi-stage business values and long-term growth, combining counterfactual LTV estimation with policy-optimized generative retrieval. It demonstrates strong methodological rigor via causal inference framing plus large-scale online A/B validation and reports measurable ecosystem-wide gains on a major platform (Taobao), indicating clear real-world applicability and timeliness for recommender/search systems. Paper 1 is timely and useful as a measurement framework for GEO, but is largely descriptive/observational and its broader cross-domain impact is less direct.

gpt-5.2·May 19, 2026

Wonvs. UniER: A Unified Benchmark for Item-level and Path-level Exercise Recommendation

Paper 2 addresses a rapidly emerging and broadly relevant topic—how generative AI search engines select and absorb information—which has implications across information science, SEO, digital marketing, AI fairness, and media studies. Its novel two-stage framework (citation selection vs. absorption) introduces a timely conceptual contribution as generative search becomes ubiquitous. Paper 1, while methodologically solid, addresses a narrower educational technology niche (exercise recommendation benchmarking) with more incremental contributions. Paper 2's timeliness and cross-disciplinary relevance give it higher potential impact.

claude-opus-4-6·May 19, 2026

Lostvs. PIPER: Content-Based Table Search via profiling and LLM-Generated Pseudoqueries

Paper 2 (PIPER) is likely to have higher scientific impact: it introduces a novel, generalizable retrieval approach (profiling + LLM-generated pseudoqueries) addressing a widespread, well-defined problem (dataset/table search in poor-metadata data lakes). It offers clear real-world applicability to data management systems and can influence IR, databases, and enterprise search. Paper 1 provides a useful measurement framework for GEO and timely insights, but its impact is more domain-specific (AI search platform behavior/SEO-like optimization) and depends on rapidly changing proprietary systems, potentially limiting methodological stability and long-term generalization.

gpt-5.2·May 19, 2026

Lostvs. Traditional statistical representations outperform generative AI in identifying expert peer reviewers

Paper 1 addresses a critical bottleneck in the scientific process—peer review—by providing rigorous empirical evidence that challenges the prevailing hype around LLMs. Its finding that traditional statistical methods outperform generative AI for specialized tasks has immediate, actionable implications for scientific infrastructure and publishing. While Paper 2 explores a novel area in AI search, Paper 1's impact is broader and more fundamental to the integrity and efficiency of the scientific community itself.

gemini-3.1-pro-preview·May 19, 2026

Lostvs. Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix

Paper 2 offers a clear algorithmic innovation (q-log odds IDF) with strong, quantified gains on a major code-retrieval benchmark, minimal deployment friction (drop-in, unchanged latency), and a principled parameterization linked to corpus statistics. Its methodological rigor is higher (bootstrap testing, ablations, cross-corpus validation) and the contribution is timely for retrieval-augmented coding. Paper 1 is valuable and timely for evaluating AI search/GEO, but it is primarily a measurement/descriptive framework with less generalizable methodological novelty and more platform/dataset dependence.

gpt-5.2·May 19, 2026

Lostvs. Differentially Private Motif-Preserving Multi-modal Hashing

Paper 2 addresses a fundamental and growing problem at the intersection of differential privacy and graph-structured data, introducing a novel theoretical concept (Hubness Explosion) and a principled framework (DMP-MH) with strong formal guarantees. It demonstrates significant empirical improvements and has broader applicability beyond cross-modal hashing to any privacy-preserving graph learning task. Paper 1, while timely and practically relevant for SEO/GEO practitioners, is primarily descriptive and measurement-focused, proposing a framework rather than a technical solution, limiting its methodological depth and cross-field impact.

claude-opus-4-6·May 18, 2026

Wonvs. Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization

While Paper 1 offers significant systems-level optimizations for AI infrastructure, Paper 2 addresses a fundamental paradigm shift in information retrieval: Generative Engine Optimization (GEO). By providing a foundational measurement framework for how AI search engines cite and absorb content, Paper 2 will heavily influence web search, information retrieval, and the future of SEO, ensuring much broader multidisciplinary and societal impact.

gemini-3.1-pro-preview·May 18, 2026

Wonvs. Generative Long-term User Interest Modeling for Click-Through Rate Prediction

Paper 1 introduces a novel measurement framework for an emerging and rapidly growing field (Generative Engine Optimization), addressing a timely problem as AI search engines reshape information discovery. Its two-stage framework (citation selection vs. absorption) provides foundational conceptual infrastructure for a new research area, with broad implications across information retrieval, SEO, digital marketing, and AI evaluation. Paper 2 is a solid but incremental contribution to CTR prediction, an already well-studied domain, offering improvements to existing two-stage frameworks. The novelty and timeliness of Paper 1's topic gives it higher potential impact.

claude-opus-4-6·May 18, 2026

#327of 655·cs.IR

#327 of 655 · cs.IR

Tournament Score

1410±32

11001750

33%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5.5

Rigor4

Novelty5

Clarity6