Jianbo Zhu, Xing Fang, Jing Wang, Mingmin Jin, Bokang Wang, Guangxin Song, Zhenyu Xie, Junjie Bai
Generative retrieval offers a promising alternative by unifying the fragmented multi-stage retrieval process into a single end-to-end model. However, its practical adoption in industrial e-commerce search remains challenging, given the massive and dynamic product catalogs, strict latency requirements, and the need to align retrieval with downstream ranking goals. In this work, we propose a retrieval framework tailored for real-world recall scenarios, positioning generative retrieval as a recall-stage supplement rather than an end-to-end replacement. Our method, CQ-SID (Category-and-Query constrained Semantic ID), employs category-aware and query-item contrastive learning along with Residual Quantized VAEs to encode items into hierarchical semantic cluster identifiers, significantly reducing beam search complexity. Additionally, we develop EG-GRPO (Expert-Guided Group Relative Policy Optimization), a reinforcement learning approach that aligns generative recall with downstream ranking under sparse rewards by injecting ground-truth samples to stabilize training. Offline experiments on TmallAPP search logs show that CQ-SID achieves up to 26.76% and 11.11% relative gains in semantic and personalized click hitrate over RQ-VAE baselines, while halving beam search size. EG-GRPO further improves multi-objective performance. Online A/B tests confirm gains in GMV (+1.15%) and UCTCVR (+0.40%). The generative recall channel now contributes substantially in production, accounting for over 50.25% of exposures, 58.96% of clicks, and 72.63% of purchases, demonstrating a viable path for deploying generative retrieval in real-world e-commerce systems.
This paper addresses the practical deployment of generative retrieval in industrial e-commerce search, proposing two main innovations:
CQ-SID (Category-and-Query constrained Semantic ID): Rather than pursuing collision-free one-item-one-ID mappings (as in TIGER and DSI), the authors deliberately design semantic IDs as *cluster identifiers*, where multiple semantically similar items share the same ID. This is built atop RQ-VAE with two enhancements: (a) category-guided first-level quantization that uses the e-commerce category taxonomy to constrain codebook assignments, and (b) query-item contrastive learning via bidirectional InfoNCE loss to align item and query representations in the quantized space. A post-processing step splits oversized clusters.
EG-GRPO (Expert-Guided Group Relative Policy Optimization): To align recall with downstream ranking objectives under sparse reward conditions, the authors inject ground-truth SIDs (from click/exposure logs) into the GRPO rollout group. This stabilizes policy gradient estimation and prevents the "mode concentration" collapse observed with vanilla GRPO.
The pragmatic design decision to position generative retrieval as a *supplement* to the existing multi-stage funnel rather than an end-to-end replacement is a key architectural insight that distinguishes this from works like Kuaishou's OneModel.
The paper makes a compelling case for industrial deployment of generative retrieval, and the production deployment at TmallAPP (Alibaba) lends significant credibility. Key practical insights include:
However, the framework is heavily tailored to e-commerce search with category taxonomies, which limits direct transferability to other domains (e.g., web search, open-domain QA).
Generative retrieval is a rapidly growing area, and the gap between academic prototypes and industrial deployment is a recognized bottleneck. This paper addresses that gap directly. The use of Qwen2.5-0.5B as the backbone reflects the current trend of deploying smaller LLMs for latency-sensitive applications. The application of GRPO (from DeepSeek) to retrieval is timely, though the adaptation is relatively straightforward.
The concurrent works (GSID, CAT-ID2, FORGE, Hi-Gen) indicate this is a crowded space with multiple industrial groups tackling similar problems. The paper's differentiation lies primarily in the cluster-based ID philosophy and the EG-GRPO stabilization technique.
This is a solid industrial systems paper that makes pragmatic contributions to deploying generative retrieval at scale. Its primary value lies in the production validation and the practical design insights rather than fundamental methodological novelty. The cluster-based ID design and the positioning as a recall supplement are sensible and potentially influential for practitioners. However, the individual technical contributions (category-guided quantization, contrastive learning, expert-guided GRPO) are incremental, and the experimental analysis could be more thorough in isolating component contributions and comparing with concurrent industrial systems.
Generated May 15, 2026
Paper 2 has higher likely scientific impact due to strong methodological and empirical rigor (offline + online A/B tests on large-scale real logs), clear real-world utility (latency-aware recall for massive dynamic catalogs), and broad relevance to IR, recommender systems, and RL for retrieval optimization. Its innovations (semantic cluster IDs to reduce beam search; expert-guided RL under sparse rewards) are timely for deploying generative retrieval in production and likely transferable across domains. Paper 1 is valuable for HCI/infographic authoring but is narrower in scope and applicability.
Paper 2 has higher potential impact due to strong real-world applicability and demonstrated production-scale outcomes (online A/B gains in GMV/UCTCVR and large exposure/click/purchase share). It tackles timely, broadly relevant problems in generative retrieval: scalability (semantic cluster IDs reducing beam search) and alignment with ranking goals (expert-guided RL under sparse rewards). The methodological contributions are concrete and validated both offline and online, increasing credibility and transferability across industrial search/retrieval systems. Paper 1 is novel for CDSR with causal/disentanglement framing, but its impact is narrower and less directly evidenced in deployment.
Paper 1 presents a highly rigorous and innovative integration of reinforcement learning with generative retrieval to solve complex alignment and latency issues in e-commerce. Its massive real-world impact is heavily substantiated by impressive online A/B testing results and production deployment statistics, offering a more comprehensive and proven methodological advancement compared to Paper 2's application of LLMs to map search.
Paper 2 demonstrates higher scientific impact due to several factors: (1) it addresses a more fundamental research problem—generative retrieval for e-commerce—with broader applicability across search and recommendation systems; (2) it introduces two novel technical contributions (CQ-SID and EG-GRPO) with stronger methodological innovation combining contrastive learning, RQ-VAEs, and RL; (3) the online A/B test results with substantial production impact (72.63% of purchases) provide compelling real-world validation; (4) the approach of aligning generative retrieval with downstream ranking via RL is more generalizable. Paper 1, while practically valuable, is more application-specific to LinkedIn's hiring workflow with less transferable methodology.
Paper 2 offers a more fundamental architectural contribution by identifying a specific failure mode (Sequential Collapse Propagation) and proposing a novel Transformer variant (TokenFormer) to unify two major paradigms in recommender systems. This theoretical and structural innovation is likely to have broader scientific impact and adoption across various recommendation domains, whereas Paper 1, despite impressive real-world production results, represents an applied engineering framework specific to generative retrieval in e-commerce.
Paper 2 demonstrates higher scientific impact due to its strong real-world validation through online A/B tests in a production e-commerce system (Tmall), showing substantial gains in GMV and conversion rates. The generative retrieval channel contributing over 50% of exposures and 72% of purchases in production is a compelling proof of practical viability. While Paper 1 addresses an important privacy-preserving cross-domain recommendation problem with novel techniques, Paper 2's combination of methodological contributions (CQ-SID, EG-GRPO), rigorous offline/online evaluation, and demonstrated industrial deployment gives it broader and more immediate impact.
While Paper 1 presents a highly successful industrial application of generative retrieval, Paper 2 tackles a fundamental architectural bottleneck in scaling recommender systems. By identifying the structural mismatch between dense connectivity and sparse data, Paper 2 introduces explicit sparsity as a scalable paradigm. This insight has broader scientific implications for deep learning on tabular and high-dimensional sparse data, potentially influencing a wider range of fields aiming to establish scaling laws for recommendation systems. Thus, Paper 2 offers a more foundational architectural innovation with greater potential for cross-domain scientific impact.
Paper 1 has higher potential scientific impact due to a more technically novel and deployable contribution: semantic cluster ID generation to reduce generative retrieval latency plus an RL alignment method for sparse-reward recall optimization, validated with large-scale offline metrics and real online A/B gains (GMV, CVR) at production scale. Its methods can transfer broadly to other large-catalog retrieval systems (ads, recommendation, web search) and are timely given industry interest in generative retrieval. Paper 2 is relevant and socially important but offers more incremental behavioral insight with narrower methodological/technical generalizability.
Paper 1 has higher impact potential due to stronger methodological novelty (semantic cluster IDs via RQ-VAE + category/query constraints, plus expert-guided RL for sparse-reward alignment), and unusually strong real-world validation: large-scale offline gains plus online A/B lifts in GMV and conversion with major production share. Its contributions are timely for deploying generative retrieval under latency/catalog constraints and can influence both IR and industrial recommender/search systems. Paper 2 is timely and broadly applicable but is more of a clever LLM+BM25 orchestration with weaker demonstrated novelty and no production evidence.
Paper 1 likely has higher scientific impact due to stronger demonstrated real-world deployment and measurable business outcomes (online A/B gains in GMV/UCTCVR, large production traffic share). Its contributions (semantic cluster IDs to cut beam complexity; expert-guided RL to handle sparse rewards and align with ranking) are concrete, system-level innovations with clear applicability to large-scale retrieval. Paper 2 is timely and broadly relevant, but impact is less certain without evidence of deployment-scale validation; its RLFT components may be incremental amid a crowded LLM-RL space.