FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

Xinhang Yuan, Zexi Huang, Anjia Cao, Xudong Lu, Zikai Wang, Penghao Zhou, Chang Liu, Wentao Guo

May 20, 2026

arXiv:2605.21832v1 PDF

cs.AI(primary)

#1078of 2292·Artificial Intelligence

#1078 of 2292 · Artificial Intelligence

Tournament Score

1419±50

10501800

47%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty7

Clarity8

Tournament Score

1419±50

10501800

47%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Modern recommender systems rely heavily on ID-based collaborative filtering: each item is represented by a unique ID embedding that accumulates collaborative signals from user interactions. Livestreaming recommendation, however, faces a unique challenge in this paradigm: a live room typically broadcasts for only tens of minutes, so its item ID remains poorly learned in a persistent cold-start state and ID-centric ranking models fail to generalize. We present FLUID, the first framework to fully retire the candidate-side item ID from a production-scale livestreaming ranker. FLUID couples a cross-domain multimodal encoder, jointly trained on short videos and livestreams to produce discrete hierarchical codes (LUCID), with a late-fusion, ID-free design that injects slice-level and room-level LUCID as independent tokens, stabilized by a staged warmup under online incremental training. Deployed on our industrial livestreaming recommenders with a cross-platform combined user base of over one billion globally, FLUID delivers significant online gains of +0.55% Quality Watch Duration, +2.05% Cold-Start Room Views, and +0.05% Active Hours.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: FLUID

1. Core Contribution

FLUID addresses a well-motivated and genuine problem: the fundamental mismatch between ID-based collaborative filtering and the ephemeral nature of livestreaming items. Live rooms last ~45 minutes on average, meaning item ID embeddings never converge (as shown convincingly in Figure 2). The paper's core novelty is twofold: (a) fully retiring the candidate-side item ID from a production-scale ranker—rather than using multimodal signals as supplements—and (b) a cross-domain multimodal encoder trained jointly on short videos and livestreams to produce hierarchical discrete codes (LUCID) that serve as the sole candidate identifier.

The problem framing itself is a contribution. The paper articulates the "ID-dominance effect"—where ranking models default to the ID signal even when multimodal features are present—and argues convincingly that this effect transitions from tolerable nuisance to fundamental bottleneck when items are ephemeral. This reframing shifts the design question from "how to better fuse multimodal features with IDs" to "how to safely retire IDs entirely."

2. Methodological Rigor

The methodology is solid and well-structured across multiple components:

Cross-domain encoder: Using SigLIP2 + Qwen3-Embedding in a single-tower configuration, trained with a two-stage recipe (alignment then joint fine-tuning), is well-justified. The cross-domain training on short videos and livestreams is a pragmatic solution to the data sparsity problem in live domains, and ablations confirm each design choice contributes.

RQ-KMeans discretization: The choice of RQ-KMeans over RQ-VAE is justified by practical stability under online streaming retraining. The prefix n-gram embedding scheme correctly addresses a real limitation of level-wise decoding in residual quantization—that the same codeword at deeper levels has different meanings depending on the prefix path.

Staged warmup: This is perhaps the most practically important contribution. The three-stage procedure (slice add-on → ID phase-out → room add-on) is well-motivated by the optimization asymmetry between ID memorization and LUCID generalization. Each stage is validated independently with AUC measurements.

Ablation quality: The ablations are comprehensive and well-designed. Table 7 systematically explores the fusion × training recipe space, and the three-arm online A/B test (Table 4) is particularly illuminating—showing that removing the item ID without LUCID yields apparent diversity gains that are actually artifacts of degraded matching. The LARM gate convergence analysis (Figure 5) provides direct evidence for the ID-dominance hypothesis.

However, some methodological gaps exist. The paper does not discuss how LUCID codes handle live rooms with dramatically shifting content (e.g., a streamer cycling through multiple activities). The majority-voting aggregation for room-level LUCID seems simplistic for such cases. Additionally, there is no discussion of computational overhead—the multimodal encoder inference latency for real-time 2-minute slice processing could be significant.

3. Potential Impact

Immediate impact: The production deployment at billion-user scale with statistically significant gains across engagement (+0.55% Quality Watch Duration), cold-start (+2.05% Cold-Start Room Views), diversity (+1.63% Unique Watched Tags), and retention (+0.05% Active Hours) is compelling. These are meaningful improvements at this scale.

Broader implications: The principle that "when items are inherently short-lived, retiring the item ID is more principled than further fusion tricks" could influence other ephemeral-item domains: live commerce, real-time auctions, breaking news recommendation, and event-based content. The staged warmup methodology for transitioning production systems from ID-based to content-based representations is transferable.

Cross-domain transfer: The demonstration that joint training across content domains (short videos → livestreams) improves encoder quality for the sparser domain has implications beyond livestreaming—any recommendation domain with sparse interaction data could benefit from similar cross-domain encoder training.

4. Timeliness & Relevance

This work is highly timely. Livestreaming is a rapidly growing content vertical, and the cold-start problem is increasingly acute as platforms scale. The paper sits at the intersection of two major trends: (a) the integration of foundation models (LLMs, vision-language models) into industrial recommendation, and (b) the growing interest in semantic IDs as alternatives to traditional ID embeddings. FLUID pushes both trends further by showing that semantic codes can fully replace—not just supplement—item IDs in a specific but important domain.

5. Strengths & Limitations

Key Strengths:

Problem-solution alignment: The ephemeral nature of livestream items makes the case for ID retirement more compelling than in other domains where ID-based approaches remain strong.

Production validation: Deployment at billion-user scale with multi-metric improvements provides strong evidence beyond offline AUC gains.

Comprehensive ablations: The systematic exploration of fusion mechanisms, training recipes, embedding designs, and encoder architectures leaves few design questions unanswered.

Honest three-arm analysis: Showing that the "w/o item ID" arm is actually a regression (despite apparent diversity gains) demonstrates intellectual honesty and deepens understanding.

Limitations:

Domain specificity: The claim of "full ID retirement" is limited to the candidate side; user-side IDs presumably remain. The approach may not generalize to domains where items have longer lifetimes and IDs are more informative.

Latency/cost analysis absent: No discussion of the computational cost of running a multimodal encoder (SigLIP2 + Qwen3-Embedding) on every 2-minute slice in real-time production.

Modest offline gains: The final AUC improvement (+0.23%) is relatively small, and the online engagement gains, while statistically significant, are moderate. The cold-start improvements (+2.05%) are more impressive.

Single production system: Results are from one company's platform; reproducibility on other systems is unknown.

LUCID quality degradation modes: No analysis of failure cases—when do LUCID codes produce poor semantic groupings, and how does this affect recommendation quality?

Overall, FLUID makes a well-argued case for a paradigm shift in ephemeral-item recommendation, backed by production evidence. Its impact is strongest within livestreaming and adjacent short-lived content domains, with transferable methodological insights (staged warmup, cross-domain encoding) applicable more broadly.

Rating:7.2/ 10

Significance 7.5Rigor 7.5Novelty 7Clarity 8

Generated May 22, 2026

Comparison History (17)

vs. Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

claude-opus-4.65/22/2026

Paper 2 addresses a fundamental and broadly applicable challenge in LLM efficiency—KV cache compression for long-context processing—which impacts the entire LLM community across numerous applications. Its novel meta-token composition approach with attention-flow redistribution offers methodological innovation applicable to any transformer-based model. Paper 1, while demonstrating impressive industrial deployment at scale, addresses a narrower domain (livestreaming recommendation) with solutions highly specific to that vertical. Paper 2's broader relevance to the rapidly growing LLM field gives it higher potential for widespread scientific impact and citation.

vs. Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

gemini-3.15/22/2026

Paper 2 presents a major paradigm shift in recommender systems by completely retiring ID-based collaborative filtering for transient items, a long-standing challenge in the field. Its proposed multimodal semantic code approach addresses the severe cold-start problem of livestreams. Furthermore, its deployment at an industrial scale with over one billion users demonstrates immense real-world impact and methodological robustness. While Paper 1 offers a useful, training-free technique for Video LLMs, it is relatively incremental compared to the structural overhaul and massive proven application presented in Paper 2.

vs. Advancing Mathematics Research with AI-Driven Formal Proof Search

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to greater novelty and cross-field breadth: it demonstrates AI-driven formal proof search solving previously open mathematical problems, a qualitatively new capability with implications for mathematics, verification, AI alignment/robustness, and scientific discovery workflows. Its methodology (formal verification in Lean, large-scale evaluation on open-problem sets) supports rigor and reproducibility. Paper 2 is highly impactful industrially and timely for recommender systems, but the ideas (multimodal codes, ID-free ranking, warmup training) are more incremental within applied ML and are narrower in academic breadth despite strong scale and deployment results.

vs. TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

claude-opus-4.65/22/2026

FLUID addresses a fundamental limitation of ID-based recommendation systems for ephemeral content, demonstrating impact at massive scale (1B+ users) with significant production gains. Its novel approach of fully retiring item IDs in favor of multimodal semantic codes represents a paradigm shift in recommender systems with broad applicability. TO-Agents, while innovative in combining LLM agents with topology optimization, addresses a narrower problem with modest success rates (60%) and limited evaluation scope. FLUID's proven industrial deployment and methodological contributions to cold-start and multimodal representation learning give it substantially higher impact potential.

vs. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

gemini-3.15/22/2026

Paper 1 addresses fundamental challenges in LLM agentic reasoning by introducing a cognitive architecture (Systems I, II, III). Its approach to self-regulated planning offers broad applicability across AI domains, advancing the pursuit of efficient, autonomous agents. While Paper 2 presents impressive industrial-scale applied improvements in recommender systems, Paper 1's contributions to foundational AI reasoning mechanisms hold greater potential for widespread scientific disruption and cross-disciplinary impact.

vs. ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

claude-opus-4.65/22/2026

ExComm addresses the fundamental and broadly relevant problem of error propagation in agentic test-time scaling, proposing a novel communication protocol with cross-agent conflict detection and soft belief updates. This has broad applicability across many reasoning tasks and LLM agent architectures, with strong empirical gains on multiple benchmarks. While FLUID solves an important industrial problem (cold-start in livestreaming recommendation) and demonstrates real-world deployment at scale, its impact is more narrowly scoped to a specific recommendation domain. ExComm's contributions to multi-agent reasoning and test-time scaling are more likely to influence a wider range of future research directions.

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

gpt-5.25/22/2026

Paper 2 has higher likely scientific impact: it identifies a counterintuitive, general failure mode (inverse scaling) in LLM forecasting under superlinear growth and regime-change tail risk, spanning multiple domains (finance, epidemiology, macro) and tying the effect to evaluation methodology. It contributes a new benchmark (ForecastBench-Sim), decompositions pinpointing where errors arise (upper-tail), and evidence across model scale and post-training, making it broadly relevant to ML reliability, evaluation, and deployment. Paper 1 is strong and proven in production but is more application-specific to livestream recommendation.

vs. Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play

gemini-3.15/22/2026

Paper 1 introduces a novel methodology challenging the standard ID-centric recommendation paradigm for ephemeral content. Its massive real-world deployment on a platform with over 1 billion users and concrete A/B test improvements demonstrate immense practical impact and scalability. In contrast, Paper 2 is an empirical evaluation of specific, proprietary LLMs in a constrained game environment, which is highly dependent on transient model versions and lacks the broad methodological innovation and industrial validation seen in Paper 1.

vs. SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to greater cross-field novelty and broader applicability: it proposes a modular, representation-level coupling between LLMs and molecular/topological/reaction modules, directly targeting a core limitation of language-only scientific reasoning with implications for drug discovery and synthesis. This can generalize to other scientific modalities and supports open research via an open-source 8B system. Paper 2 is methodologically strong and highly impactful industrially, but is more domain-specific (livestreaming recommendation) and its main contribution (ID-free codes/tokens) is less likely to broadly reshape scientific practice beyond recommender systems.

vs. Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings

gemini-3.15/22/2026

Paper 2 presents a highly impactful, industrially deployed solution to a major challenge in recommender systems (cold-start in livestreaming). By entirely retiring item IDs in favor of multimodal semantic codes, it offers a significant architectural innovation. Its proven real-world application at a scale of over one billion users and measurable online gains demonstrate massive technological and economic impact, outweighing Paper 1's domain-specific contributions to food ingredient embeddings.

vs. AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

claude-opus-4.65/22/2026

AutoResearchClaw addresses the fundamental challenge of automating scientific discovery with a comprehensive multi-agent framework featuring novel mechanisms (self-healing execution, cross-run evolution, structured debate, human-in-the-loop collaboration modes). Its potential impact spans all scientific fields by augmenting research itself. The finding that targeted human intervention outperforms both full autonomy and exhaustive oversight is a significant insight for AI-assisted research. While FLUID solves an important industrial recommendation problem with real deployment results, its impact is narrower—primarily within livestreaming recommendation. AutoResearchClaw's breadth of potential impact across all of science gives it higher estimated scientific impact.

vs. AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

claude-opus-4.65/22/2026

FLUID addresses a fundamental limitation of ID-based recommender systems for ephemeral content (livestreaming), proposing a novel ID-free framework with multimodal semantic codes. Its deployment at massive scale (1B+ users) with measurable online gains demonstrates real-world impact. The paradigm shift from ID-based to semantic code-based recommendation has broad implications for recommender systems research. Paper 2, while introducing a useful benchmark for T2I prompting evaluation, addresses a narrower problem with less transformative potential and limited applicability beyond the text-to-image domain.

vs. AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

gpt-5.25/22/2026

Paper 1 has higher likely scientific impact due to its demonstrated large-scale, real-world deployment and measurable online gains on a billion-user industrial livestreaming recommender, addressing a core, pervasive problem (extreme item cold-start and ephemeral content) with an ID-free ranking architecture and multimodal discrete codes. This combination of methodological innovation plus proven production value suggests broad applicability to other short-lived content domains and recommender systems. Paper 2 is timely and useful as a benchmark, but its impact depends on community adoption and primarily targets evaluation within the T2I prompting niche.

vs. IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

gemini-3.15/22/2026

Paper 1 addresses a fundamental and ubiquitous bottleneck in LLM agents—idle time during tool execution—with a generic speculative planning approach. Its methodology is broadly applicable across the rapidly expanding field of autonomous agents, promising high cross-domain scientific impact. In contrast, while Paper 2 demonstrates impressive industrial scale and solves a critical problem in livestreaming recommendation, its scientific contributions are more domain-specific, limiting its breadth of impact compared to the foundational LLM inference improvements in Paper 1.

vs. Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

gemini-3.15/22/2026

While Paper 1 demonstrates massive industrial impact and solves a specific cold-start problem in recommender systems, Paper 2 offers higher scientific impact by proposing a novel paradigm for LLM agents. Shifting the focus from model parameter updates to runtime interface adaptation addresses a critical bottleneck in agentic AI. Its demonstrated transferability across 18 models and multiple environments ensures broad applicability, potentially influencing the fundamental methodology of how researchers build, train, and deploy autonomous LLM agents across diverse fields.

vs. \ECUAS{n}: A family of metrics for principled evaluation of uncertainty-augmented systems

gpt-5.25/22/2026

Paper 2 (FLUID) likely has higher scientific impact due to its demonstrated real-world, industrial-scale deployment and measurable online improvements on a >1B-user platform, addressing a timely and pervasive problem (cold-start in ephemeral livestream items). It proposes an ID-free ranking paradigm with multimodal discrete semantic codes and an online training strategy, which could influence recommender system design broadly. Paper 1 offers principled evaluation metrics for uncertainty-augmented systems with solid theoretical grounding, but its impact is more specialized to evaluation methodology and may diffuse more slowly than a deployed, system-level recommender innovation.

vs. Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

claude-opus-4.65/22/2026

Paper 2 addresses a fundamental and widespread problem in recommender systems (cold-start for ephemeral content) with a novel, fully deployed solution at billion-user scale. Its demonstrated real-world impact on a production system, combined with the generalizable insight of replacing ID-based representations with multimodal semantic codes, gives it broader applicability across recommendation domains. Paper 1, while intellectually elegant in formalizing trust calibration as preferential Bayesian optimization, is more niche in scope and remains primarily a theoretical formalization without demonstrated large-scale empirical validation.