Don't Measure Once: Measuring Visibility in AI Search (GEO)

Julius Schulte, Malte Bleeker, Philipp Kaufmann

Apr 8, 2026

arXiv:2604.07585v1 PDF

cs.IR(primary)cs.AI

#266of 506·cs.IR

#266 of 506 · cs.IR

Tournament Score

1410±21

11001750

43%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance5.5

Rigor5

Novelty3.5

Clarity7

Tournament Score

1410±21

11001750

43%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

As large language model-based chat systems become increasingly widely used, generative engine optimization (GEO) has emerged as an important problem for information access and retrieval. In classical search engines, results are comparatively transparent and stable: a single query often provides a representative snapshot of where a page or brand appears relative to competitors. The inherent probabilistic nature of AI search changes this paradigm. Answers can vary across runs, prompts, and time, making one-off observations unreliable. Drawing on empirical studies, our findings underscore the need for repeated measurements to assess a brand's GEO performance and to characterize visibility as a distribution rather than a single-point outcome.

AI Impact Assessments

(3 models)

Scientific Impact Assessment

Core Contribution

This paper addresses the measurement methodology for brand visibility in AI-powered search engines (ChatGPT, Gemini, Google AI Mode, Perplexity), arguing that the stochastic nature of LLM-generated responses makes single-point measurements unreliable. The key insight is straightforward: unlike traditional search engine rankings that are largely deterministic for a given query, generative search engines produce variable outputs across runs, prompts, and time. The authors propose treating GEO visibility as a distribution rather than a point estimate, and provide empirical guidelines for minimum measurement requirements (≥7 runs per prompt per day for brand monitoring; 2-4 week rolling windows for temporal tracking).

The paper introduces two complementary dimensions—visibility (whether a brand appears) and stability (how consistently it appears)—and measures both using Jaccard similarity and Rank-Biased Overlap (RBO) across temporal and simultaneous-run datasets.

Methodological Rigor

The experimental design is reasonable but limited in scope. The study covers four Swiss-German verticals with 8 prompts each, queried across 4 AI search engines over ~45 days (temporal dataset) and up to 10 simultaneous runs (stability dataset). Several methodological choices deserve scrutiny:

Strengths in methodology:

The separation of temporal drift from stochastic variation through simultaneous re-runs is a sound experimental design choice.

The 24-hour timestamp filter for simultaneous runs is a sensible control.

The edge-case policy for empty result sets is clearly articulated and defensible.

The bootstrap convergence analysis (Appendix J) provides actionable minimum run-count recommendations.

Code and data are publicly available, supporting reproducibility.

Weaknesses:

The sample is geographically narrow (Swiss IP addresses only) and linguistically constrained (German prompts), limiting generalizability.

Only 8 prompts per vertical is quite small; the paper acknowledges prompt-level heterogeneity is substantial yet draws conclusions from a thin prompt portfolio.

Brand detection via substring matching is crude and acknowledged as a limitation. False positives and missed synonyms/abbreviations could meaningfully affect the stability metrics.

The 10-run maximum for the simultaneous dataset is modest—the convergence analysis treats the 10-run mean as ground truth, which introduces finite-population correction issues the authors acknowledge.

The 70% brand-detection threshold for campaign inclusion is somewhat arbitrary; sensitivity analysis around this threshold is absent.

There's no statistical testing (e.g., confidence intervals on the Jaccard/RBO differences between engines or campaigns), just descriptive statistics.

Potential Impact

The paper addresses a genuinely practical problem. As marketing budgets increasingly need to account for AI search visibility, the finding that source sets overlap by only 34-42% between consecutive days is striking and commercially relevant. The concrete recommendations (minimum 7 runs, 2-4 week windows, multi-prompt portfolios) are directly actionable for marketing practitioners and GEO tool developers.

However, the academic contribution is somewhat thin. The core finding—that stochastic systems produce variable outputs and need repeated measurement—is not surprising from a statistical or machine learning perspective. The paper essentially documents an expected consequence of temperature-based sampling and retrieval-augmented generation. The novelty lies more in the empirical quantification for a specific application domain than in any methodological or theoretical advance.

The paper could influence the emerging GEO tooling ecosystem by establishing measurement standards. The Gini coefficient analysis of citation concentration (mean 0.715) adds a useful descriptive dimension about the competitive landscape in AI search.

Timeliness & Relevance

The paper is highly timely. The shift toward AI-powered search is accelerating rapidly, and the marketing industry lacks standardized measurement frameworks. The paper correctly identifies that no native monitoring tools (equivalent to Google Search Console) exist for generative search, creating an observability gap. Published in April 2026 with data from early 2026, it captures a critical moment in the transition.

The relevance extends beyond marketing to information retrieval research more broadly—the instability findings have implications for reproducibility of any study that treats AI search outputs as ground truth.

Strengths & Limitations

Key Strengths:

1. Addresses a timely, practical problem with clear industry relevance

2. Clean experimental design separating temporal and stochastic variation

3. Actionable recommendations with empirical backing (run counts, window lengths)

4. Open code and data

5. The distinction between source-level and brand-level stability is useful and non-obvious (brands are more stable than sources)

Notable Limitations:

1. Limited geographic and linguistic scope (Swiss-German only)

2. Small prompt set (8 per vertical) relative to claims about prompt heterogeneity

3. No formal statistical hypothesis testing

4. The core insight (stochastic systems need repeated measurement) is intuitive rather than novel

5. No comparison with actual SEO stability baselines—the paper claims SEO is "comparatively transparent and stable" but provides no quantitative evidence

6. The paper doesn't explore *why* certain prompts or brands are more stable, missing an opportunity for deeper analysis

7. One author is affiliated with Aurora Intelligence (a GEO monitoring company), raising potential conflict of interest that should be more prominently disclosed

8. The paper lacks engagement with the statistical literature on measurement reliability (e.g., test-retest reliability, intraclass correlation coefficients) that would strengthen the theoretical framing

Overall Assessment

This is a competent empirical study that documents an important practical phenomenon for the emerging GEO field. Its primary value is descriptive and prescriptive rather than theoretically innovative. The findings will be useful for practitioners and tool developers, but the academic contribution is incremental—it applies well-known statistical principles (repeated measurement, distributional thinking) to a new domain without significant methodological innovation. The narrow geographic scope and small prompt sets limit the generalizability of the specific numerical recommendations.

Rating:4.8/ 10

Significance 5.5Rigor 5Novelty 3.5Clarity 7

Generated Apr 13, 2026

Comparison History (80)

vs. CCD-Level and Load-Aware Thread Orchestration for In-Memory Vector ANNS on Multi-Core CPUs

claude-opus-4.65/15/2026

Paper 2 presents a concrete, well-evaluated systems contribution with significant real-world impact—a CCD-aware thread orchestration framework for vector ANNS deployed in production at Xiaohongshu (RedNote), demonstrating substantial throughput (3.7x) and latency improvements. It addresses a fundamental hardware-software co-optimization problem relevant to the rapidly growing vector search ecosystem. Paper 1 raises a valid methodological point about measurement variability in GEO but is more observational and narrow in scope, essentially advocating for repeated measurements—a less technically deep contribution with limited broader applicability beyond the GEO niche.

vs. NumColBERT: Non-Intrusive Numeracy Injection for Late-Interaction Retrieval Models

claude-opus-4.65/12/2026

Paper 2 addresses the emerging and broadly relevant problem of measuring visibility in AI-generated search results (GEO), which affects a wide range of stakeholders including businesses, marketers, and researchers studying information access. Its insight that visibility should be treated as a distribution rather than a point estimate has fundamental methodological implications for an entirely new and rapidly growing field. Paper 1, while technically sound, addresses a narrower problem (numeracy in dense retrieval) with incremental improvements to an existing architecture (ColBERT), limiting its breadth of impact.

vs. OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries

gpt-5.25/8/2026

Paper 2 is likely higher impact: it introduces a new, broadly relevant problem class (oblique/latent queries) and a concrete benchmark over multiple tasks and real corpora, which can catalyze measurable progress across IR, RAG, evaluation, and LLM-assisted search. It also surfaces a compelling retrieval-vs-verification asymmetry with clear methodological framing and actionable research directions. Paper 1 is timely and useful for GEO/measurement practice, but is narrower in scope and more incremental (improving evaluation methodology for a specific setting) compared to a benchmark that can become a community standard.

vs. Beyond Long Tail POIs: Transition-Centered Generalization for Human Mobility Prediction

claude-opus-4.65/8/2026

Paper 1 presents a novel framework (RECAP) addressing a well-defined, significant problem in human mobility prediction with rigorous methodology—formulating transition-level sparsity as compositional generalization, proposing concrete solutions (multi-hop transitivity, revisit evidence, warm-transition holdout training), and validating on multiple real-world datasets. It offers both theoretical insight and practical advances. Paper 2 makes a valid but relatively narrow observational point about measurement variability in GEO, lacking a substantial methodological contribution or solution framework, limiting its broader scientific impact.

vs. Unified Value Alignment for Generative Recommendation in Industrial Advertising

claude-opus-4.65/8/2026

Paper 1 addresses a fundamental methodological issue in the emerging field of generative engine optimization (GEO) — the need to treat AI search visibility as a distribution rather than a point estimate. This has broad implications across information retrieval, SEO, digital marketing, and AI evaluation methodology. Its conceptual contribution applies widely as LLM-based search becomes ubiquitous. Paper 2, while technically strong with real-world deployment results, is a more incremental, application-specific contribution to advertising recommendation at a single platform, limiting its broader scientific impact.

vs. RecGPT-Mobile: On-Device Large Language Models for User Intent Understanding in Taobao Feed Recommendation

claude-opus-4.65/7/2026

RecGPT-Mobile addresses the practically significant challenge of deploying LLMs on mobile devices for real-time recommendation, with demonstrated production-scale results on Taobao. It combines novelty (on-device LLM for recommendations), strong real-world applicability, and methodological rigor (offline and online experiments). Paper 1 raises a valid methodological concern about measurement reliability in GEO but is more incremental—essentially advocating repeated measurements—with narrower scope and less technical depth. Paper 2's contributions span mobile AI, recommendation systems, and edge computing, giving it broader cross-field impact.

vs. RecGPT-Mobile: On-Device Large Language Models for User Intent Understanding in Taobao Feed Recommendation

claude-opus-4.65/7/2026

Paper 2 addresses a fundamentally new and broadly relevant problem—how to measure visibility in AI-powered search engines—which has implications across information retrieval, marketing, SEO, and AI fairness. It introduces a paradigm shift in how we think about search result measurement (distribution vs. point estimate), applicable to anyone studying or optimizing for LLM-based search. Paper 1, while practically valuable for Taobao's mobile recommendation system, is more narrowly focused on a specific engineering deployment problem with less generalizable scientific contributions.

vs. Intelligent Elastic Feature Fading: Enabling Model Retrain-Free Feature Efficiency Rollouts at Scale

gpt-5.25/5/2026

Paper 1 presents a novel, system-level infrastructure method (retrain-free, serving-time elastic feature fading) with quantified large-scale production gains, strong methodological support (offline + online experiments, safety/reversibility/monitoring), and immediate real-world applicability for industrial ranking systems. Its impact spans ML systems, recommender/search ranking, and cost/efficiency engineering, and addresses a timely operational bottleneck. Paper 2 raises an important measurement point for GEO but is primarily methodological guidance with narrower technical novelty and less demonstrated system-level or algorithmic advancement.

vs. Bridging Behavior and Semantics for Time-aware Cross-Domain Sequential Recommendation

gemini-35/5/2026

Paper 1 introduces a foundational paradigm shift for the newly emerging field of Generative Engine Optimization (GEO) in AI search. While Paper 2 offers a rigorous, highly technical improvement in recommendation systems, Paper 1 addresses a timely, fundamental problem with broad implications across information retrieval, digital marketing, and business, giving it a significantly higher potential for widespread scientific and real-world impact.

vs. Budget-Constrained Online Retrieval-Augmented Generation: The Chunk-as-a-Service Model

gpt-5.25/1/2026

Paper 2 has higher likely scientific impact: it proposes a concrete new service model (Chunk-as-a-Service) plus an online algorithm (UCOSA) addressing a timely, widely felt constraint in RAG deployment—cost/utility under budgets. This has direct real-world applicability for RAG providers/users, clearer methodological contribution (formal selection under constraints, baselines, quantified gains), and broader relevance across IR, systems, and ML ops/economics. Paper 1 is important but more observational/measurement-focused for GEO, with narrower scope and less algorithmic or systems novelty.

vs. The Bandit's Blind Spot: The Critical Role of User State Representation in Recommender Systems

gpt-5.24/30/2026

Paper 2 targets a rapidly emerging, high-stakes area—LLM-based AI search and generative engine optimization—where measurement methodology is currently underdeveloped yet broadly needed by academia and industry. Its core contribution (visibility as a distribution requiring repeated measurements) is novel in this context, timely, and likely to generalize across models, platforms, and evaluation settings (IR, HCI, marketing analytics, AI auditing). Paper 1 is rigorous and valuable for bandit recommenders, but its scope is narrower and more incremental within a mature field.

vs. K-CARE: Knowledge-driven Symmetrical Contextual Anchoring and Analogical Prototype Reasoning for E-commerce Relevance

gpt-5.24/29/2026

Paper 1 introduces a more novel, technically concrete framework (SCA + APR) that tackles a well-scoped industrial bottleneck (knowledge boundaries in e-commerce relevance) and reports both offline and online A/B test gains, indicating strong methodological rigor and real-world applicability. Its approach (external knowledge grounding plus analogical calibration) is potentially transferable to other domain-specific retrieval/relevance settings. Paper 2 is timely and broadly relevant as a measurement/metrics framing for GEO, but it is more observational and prescriptive, with less algorithmic innovation and likely smaller downstream technical leverage compared to a deployable modeling framework.

vs. CASP: Support-Aware Offline Policy Selection for Two-Stage Recommender Systems

gpt-5.24/28/2026

Paper 2 has higher likely scientific impact due to a clearer methodological contribution (a new support-aware offline policy selection objective for two-stage recommenders), formal guarantees (population/finite-class/reconstructed-propensity), and demonstrated empirical evaluation. It addresses a broadly relevant and timely problem in recommender systems and offline evaluation/selection, with potential applicability to any multi-stage decision pipeline. Paper 1 is timely and practically important for GEO measurement, but appears more like an empirical/measurement framing with less algorithmic/theoretical novelty and narrower methodological rigor than CASP.

vs. Disagreement as Signals: Dual-view Calibration for Sequential Recommendation Denoising

gemini-34/28/2026

Paper 2 addresses the emerging and highly timely field of Generative Engine Optimization (GEO) in AI search, proposing a fundamental methodological shift in how information visibility is measured. Its insights impact both broad academic research in information retrieval and real-world industry practices. In contrast, Paper 1 offers a specialized, albeit rigorous, algorithmic improvement within the narrower domain of sequential recommendation systems.

vs. Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models

gemini-34/28/2026

Paper 2 addresses a fundamental paradigm shift in information retrieval brought by AI search engines, conceptualizing Generative Engine Optimization (GEO). Its insights into measuring visibility as a distribution rather than a single point have broad implications for how researchers, brands, and developers evaluate AI search systems. While Paper 1 offers a solid algorithmic improvement for LLM re-ranking, Paper 2 tackles a highly timely, cross-disciplinary problem with widespread real-world applications and impact across the broader AI and web ecosystems.

vs. Counterfactual Multi-task Learning for Delayed Conversion Modeling in E-commerce Sales Pre-Promotion

gemini-34/24/2026

Paper 1 addresses the highly timely and disruptive field of Generative Engine Optimization (GEO) in AI search. Its shift from single-point to distributional measurement of search visibility fundamentally challenges traditional SEO paradigms, offering broad implications across information retrieval, marketing, and web science. Paper 2, while methodologically rigorous and commercially useful, focuses on a narrow, domain-specific problem (e-commerce pre-promotion conversion rates), making its scientific impact more incremental and less broadly applicable than Paper 1.

vs. Counterfactual Multi-task Learning for Delayed Conversion Modeling in E-commerce Sales Pre-Promotion

gemini-34/24/2026

Paper 1 addresses a fundamental paradigm shift in information retrieval caused by AI search engines. Establishing a new measurement framework for Generative Engine Optimization (GEO) has broad implications across search, marketing, HCI, and AI auditing. In contrast, Paper 2 focuses on a highly specific, commercially driven problem (pre-promotion delayed conversion in e-commerce). While Paper 2 is methodologically rigorous and financially valuable for platforms, Paper 1 offers significantly higher scientific and societal impact by defining how we measure information visibility in the new era of generative AI.

vs. PAPERMIND: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs

gpt-5.24/24/2026

Paper 2 is likely to have higher scientific impact because it introduces a broad, reusable benchmark (PAPERMIND) for integrated multimodal scientific reasoning and critique across seven domains, enabling standardized evaluation and driving progress across many LLM research areas. Its methodological contribution (dataset + task taxonomy + extensive experiments) supports rigorous, repeatable comparisons and has clear downstream applications in model development, evaluation, and scientific assistants. Paper 1 is timely and useful for GEO measurement practice, but its scope is narrower and less likely to catalyze cross-field research.

vs. Multistakeholder Impacts of Profile Portability in a Recommender Ecosystem

gpt-5.24/24/2026

Paper 2 has higher potential impact due to its novelty in studying ecosystem-level interventions (algorithmic pluralism + profile portability) rather than purely algorithmic tweaks, strong real-world relevance to emerging data portability regulation, and broader cross-field implications spanning recommender systems, governance/policy, and market design. It also targets multiple stakeholders and equity considerations, increasing applicability and timeliness. Paper 1 is timely and practical for AI search/GEO measurement, but is narrower in scope and likely more incremental (measurement guidance) with less breadth across disciplines.

vs. DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion

gemini-34/21/2026

Paper 2 addresses a fundamental paradigm shift in search (GEO vs SEO) caused by generative AI. By proposing a new evaluation methodology for AI search visibility, it has broad implications for information retrieval, commercial SEO, and AI system evaluation. Paper 1, while technically sound and practical, tackles a much narrower domain (in-document query auto-completion) and offers a more specialized technical improvement rather than a field-wide conceptual shift.