Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

Rahul Roy, Nur Sunar, Jayashankar M. Swaminathan

Jun 9, 2026arXiv:2606.11118v1

cs.LGmath.OCmath.PRstat.APstat.ML

#3679of 5669·cs.LG

#3679 of 5669 · cs.LG

Tournament Score

1365±44

10501750

44%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty7.5

Clarity7

Abstract

We study a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers in a discrete-time setting. In each period, a customer arrives seeking service, and the platform chooses an assortment of sellers to display. The customer then proposes a transaction to at most one seller in the assortment according to a multinomial logit choice model. After a fixed number of periods, sellers review the proposals they have received and each chooses at most one customer according to another multinomial logit choice model, after which the cycle repeats. A key challenge is that the platform does not know the choice-model parameters of either customers or sellers in advance. To our knowledge, this is the first study of a dynamic assortment problem in which both sides' choice parameters are unknown. We develop a data-driven algorithm that learns these parameters while optimizing the platform's objective over time. We evaluate performance using regret, which measures revenue loss relative to a clairvoyant benchmark that knows all parameters and customer arrivals in advance. We show that the algorithm's worst-case regret grows polylogarithmically over time, and we derive a matching lower bound, establishing its rate optimality.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper introduces the "Two-Way Learning" (TWL) problem — a dynamic assortment optimization problem on two-sided platforms where the platform must simultaneously learn the MNL choice parameters of both customers (demand side) and sellers (supply side). The key novelty is that prior work on dynamic assortment under incomplete information exclusively addressed one-way learning (learning only customer preferences), assuming seller preferences are either known or irrelevant. The authors develop a UCB-based algorithm (TWL-UCB) achieving O(log²(NT)) worst-case regret and prove a matching Ω(log²(NT)) lower bound, establishing rate optimality.

The practical motivation is compelling: on platforms like Airbnb, EnergySage, and Freelancer, transactions require mutual agreement. Fradkin (2017) documents that hosts rejected 42% of guest requests on Airbnb, demonstrating that ignoring seller-side preferences leads to substantial revenue loss.

Methodological Rigor

Algorithm Design. The TWL-UCB algorithm has a clean three-step structure: (1) offer assortments maximizing estimated marginal increase in expected matches using UCB estimates, (2) update customer UCB estimates after observing choices each period, and (3) update seller UCB estimates at epoch boundaries when matching decisions are observed. The use of sharper confidence radii (incorporating the empirical mean in the square-root term, following Kleinberg et al. 2008) rather than standard Hoeffding-type bounds is technically important for achieving the polylogarithmic rate.

Upper Bound Analysis. The proof of Theorem 1 proceeds through a careful decomposition into high-probability and low-probability regimes. The key insight is the "marginal increase in expected matches" formulation (Equation 6), which decomposes the epoch-level reward into period-level contributions, bridging the asynchronous feedback structure. Lemma 2 bounds the gap between UCB-estimated and true marginal increases, and the aggregation over all periods yields the O(log²(NT)) bound. The technical execution appears sound, with appropriate handling of the delayed feedback and the nonlinear dependence of rewards on unknown parameters.

Lower Bound Analysis. Theorem 2 constructs an adversarial parameterization (Equation 32) creating a "needle in a haystack" scenario where the optimal assortment is nearly indistinguishable from suboptimal ones. The proof leverages KL divergence bounds (Lemma 5) and Pinsker's inequality to connect the regret to information-theoretic limits. The matching Ω(log²(NT)) lower bound confirms rate optimality. The construction is elegant — the adversary sets K=1 and makes reward parameters constant over time, simplifying the state dependence while preserving the fundamental two-sided learning difficulty.

Potential Concern. The assumption that v_{z,i} ≤ 1 and u_{i,z} ≤ 1 (Assumption 1) is standard but means the no-choice option dominates. While reasonable for many platforms, this restricts applicability to settings with high engagement rates. The paper acknowledges this is common in practice (citing cart abandonment rates).

Potential Impact

Theoretical. The polylogarithmic regret bound represents a significant improvement over the Õ(√T) regret achieved in one-way learning settings (Agrawal et al. 2019, Aznag et al. 2021). This is somewhat surprising — despite the harder problem (learning two sides), the structure of the two-sided interaction enables faster learning. This insight could influence how researchers think about exploration-exploitation tradeoffs in multi-agent settings.

Practical. The algorithm provides actionable guidance for platform operators. The simulation finding that assortment size beyond a threshold (~30 out of 100 sellers) yields diminishing returns is practically valuable. The robustness results (Theorems 3-4) showing performance is unaffected by partial information disclosure or customer-dependent rewards enhance practical applicability.

Broader Influence. The framework could extend to other two-sided matching contexts: job platforms, dating apps, ride-sharing. The paper opens a new problem class — combinatorial bandits with two-sided learning and delayed feedback — that could attract substantial follow-up work.

Timeliness & Relevance

The paper addresses a genuine gap at the intersection of platform economics and online learning. Two-sided platforms are increasingly dominant, and the mismatch between one-sided learning models and two-sided reality is well-documented. The timing is excellent given growing interest in algorithmic marketplace design.

Strengths

1. Novel problem formulation that captures a genuine practical need — the TWL framework is a meaningful departure from existing one-way learning models.

2. Rate-optimal algorithm with matching upper and lower bounds, providing a complete characterization of the problem's fundamental difficulty.

3. Polylogarithmic regret despite the harder two-sided setting, revealing interesting structural properties.

4. Comprehensive analysis including extensions (partial information, customer-dependent rewards) and simulations comparing against the adapted Agrawal et al. (2019) algorithm.

5. Clean exposition of a complex problem with clear separation of algorithmic design, upper bound, and lower bound analyses.

Limitations

1. MNL assumption on both sides may be restrictive; real seller decisions may involve more complex considerations (capacity, scheduling, fatigue).

2. Fixed epoch length K is assumed known and constant; in practice, response windows vary.

3. Simulation scale is modest (N=10 sellers, limited customer types). Larger-scale experiments would strengthen empirical claims.

4. Independence assumptions — customer arrivals are treated as exogenous, and seller decisions across epochs are independent. Strategic behavior by either side is not modeled.

5. The lower bound construction uses K=1, which is the simplest epoch structure. Whether the log²(NT) rate persists for general K as the tight bound remains less clear, though the upper bound holds for all K.

6. Comparison baseline — only one competing algorithm (Agrawal et al. 2019) is benchmarked against, and only under K=1 (most favorable for the competitor).

Rating:7.2/ 10

Significance 7.5Rigor 7.5Novelty 7.5Clarity 7

Generated Jun 10, 2026

Comparison History (25)

Lostvs. MemNovo: Look Back at the Spectrum for Balanced De Novo Peptide Sequencing from Mass Spectrometry

Paper 2 addresses a critical bottleneck in proteomics and computational biology, offering a training-free mechanism that significantly improves de novo peptide sequencing. Its direct implications for disease research and drug discovery provide higher potential real-world impact and breadth across fields compared to Paper 1, which, while theoretically rigorous, is more narrowly focused on operations research and e-commerce optimization.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

Paper 1 addresses a critical bottleneck in Large Language Model (LLM) training by accelerating the RL rollout stage by up to 1.8x. Given the current explosion of generative AI research and deployment, significant improvements to LLM training efficiency offer immense practical value, massive cost savings, and broader, more immediate real-world impact compared to the specialized operations research focus of Paper 2.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Paper 2 addresses a fundamental theoretical problem in operations research/management science—dynamic assortment optimization on two-sided platforms with unknown parameters on both sides. It introduces a novel formulation (first to study unknown parameters on both sides), develops a provably rate-optimal algorithm with matching upper and lower regret bounds, and has broad applicability to real-world platforms (e.g., gig economy, marketplaces). Paper 1, while useful, is primarily an engineering benchmark/adapter for evaluating coding agents—incremental infrastructure work with narrower scope and less methodological novelty.

claude-opus-4-6·Jun 11, 2026

Lostvs. Latent World Recovery for Multimodal Learning with Missing Modalities

Paper 2 likely has higher scientific impact due to broader and more timely relevance: missing-modality multimodal learning is a widespread problem across healthcare, bioinformatics, robotics, and sensing. The proposed framework targets real-world incomplete data settings and demonstrates results on consequential biomedical tasks (e.g., cancer prediction), increasing translational potential. Paper 1 is methodologically rigorous with strong regret guarantees and novelty in learning both-sided choice parameters, but its applicability is narrower (two-sided platform assortment under specific MNL assumptions) and may influence a smaller set of domains compared with missing-modality representation learning.

gpt-5.2·Jun 11, 2026

Lostvs. COGENT: Continuous Graph Emulators with Neural Ordinary Differential Equations for Long-Term Physical Forecasting

Paper 2 has higher potential impact: it introduces a broadly applicable ML framework (continuous-time graph Neural ODE emulator) for long-term forecasting on irregular meshes, a need spanning climate/earth systems, CFD, and geoscience. The ability to query arbitrary future times and improved long-horizon stability are timely and practically valuable for surrogate modeling. Paper 1 is methodologically rigorous with strong regret guarantees and novelty in learning both sides’ choice models, but its impact is more specialized to two-sided platform assortment/revenue management. Paper 2’s cross-domain applicability and timeliness likely yield wider scientific uptake.

gpt-5.2·Jun 10, 2026

Lostvs. Efficiently Learning Drifting Halfspaces with Massart Noise

Paper 2 addresses a fundamental problem in theoretical machine learning: learning under both concept drift and Massart noise. By establishing a computationally efficient algorithm and providing formal evidence of an information-computation tradeoff, it offers deep foundational insights that can broadly influence robust machine learning. While Paper 1 provides a highly practical solution for two-sided platforms, Paper 2's theoretical contributions to learning halfspaces address core algorithmic challenges with wider generalizability across AI and data science.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Data-Driven Runway and Taxiway Exits Prediction of Landing Aircraft: A Case Study at Hartsfield-Jackson Atlanta International Airport

Paper 1 addresses a fundamental problem in online platform optimization with novel theoretical contributions—first to study two-sided unknown parameters in dynamic assortment, achieving rate-optimal regret bounds. It has broad applicability across platform economics, operations research, and machine learning. Paper 2 is a well-executed but incremental applied ML study limited to a single airport, benchmarking existing classifiers with moderate performance. Paper 1's theoretical novelty, generalizability, and methodological rigor give it substantially higher potential for scientific impact across multiple fields.

claude-opus-4-6·Jun 10, 2026

Lostvs. Optimal Post-Training Quantization Scales and Where to Find Them

Paper 2 addresses the highly timely and practically impactful problem of LLM compression through quantization, which is critical for deploying large language models efficiently. PiSO provides an exact, efficient solution for optimal quantization scales—a fundamental component used across the rapidly growing LLM deployment ecosystem. Its broad applicability across model families and bit-widths, combined with the massive interest in efficient LLM inference, gives it wider potential impact. Paper 1, while theoretically rigorous with optimal regret bounds for two-sided assortment learning, addresses a more niche operations research problem with narrower immediate applicability.

claude-opus-4-6·Jun 10, 2026

Wonvs. A Systematic Approach for Selecting Trajectories for Data Augmentation

Paper 2 tackles a novel theoretical problem with unknown parameters on both sides of a two-sided platform, providing an algorithm with rate-optimal theoretical guarantees (regret bounds). This rigorous methodological contribution and broad applicability to prominent online marketplaces suggest a higher potential scientific impact compared to Paper 1, which focuses on empirical evaluation of heuristic data augmentation strategies for trajectory data.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Learning Doubly Sparse Explicitly Conditioned Transforms

Paper 1 addresses a highly relevant problem for the booming two-sided platform economy (e.g., Airbnb, Uber) with exceptional theoretical rigor. By providing the first study of dynamic assortment with unknown choice parameters on both sides, and establishing rate-optimal polylogarithmic regret with a matching lower bound, it offers immense real-world applicability and fundamental methodological contributions. Paper 2 presents a solid signal processing advancement, but Paper 1 has broader economic implications and stronger theoretical guarantees.

gemini-3.1-pro-preview·Jun 10, 2026

#3679of 5669·cs.LG

#3679 of 5669 · cs.LG

Tournament Score

1365±44

10501750

44%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty7.5

Clarity7