Lingkai Kong, Hezi Jiang, Andrew Ma, Keyu Wang, Akseli Kangaslahti, Milind Tambe
Peer-referral recruitment systems such as respondent-driven sampling are critical for studying and intervening on hidden populations affected by infectious diseases. To accelerate recruitment, public health agencies must adaptively allocate limited referral resources across multiple rounds, where current decisions shape both the number and the covariates of future recruits. Prior work makes this problem tractable by assuming that referrals are drawn i.i.d.\ from a homogeneous population, an assumption that ignores the homophily and shared context that drive real peer recruitment. We instead consider a more realistic model in which both referral capacity and the covariates of newly referred individuals are conditioned on the referrer, learned from data with a censored count model and a conditional generative model. The resulting planning problem is challenging because each candidate allocation induces a different distribution over future recruits. We propose \emph{Generative Frontier Planning} (GFP), a model-based planner that replaces per-step Monte-Carlo sampling with a deterministic backup over a latent covariate-coverage value surrogate. The surrogate is designed so that the expected value of the next frontier depends on the offspring generative model only through finite-dimensional summaries that are amortized offline, and so that the resulting per-round objective is monotone with diminishing returns. Together, these two properties make planning tractable: the deterministic backup eliminates Monte-Carlo sampling, and the diminishing-returns structure lets a marginal greedy allocation achieve a -approximation for the per-round problem. On a simulation environment calibrated to a real respondent-driven sampling dataset, GFP outperforms random, reinforcement-learning, and i.i.d.\ dynamic-programming baselines across four discount factors.
This paper addresses adaptive resource allocation in peer-referral recruitment systems (e.g., respondent-driven sampling for hidden populations), where the key departure from prior work is modeling covariate-dependent arrivals rather than assuming i.i.d. population-level draws. The main contribution is Generative Frontier Planning (GFP), a model-based planner that combines three components: (1) a censored count model for referral capacity, (2) a conditional diffusion model for offspring covariates, and (3) a structured value surrogate based on latent covariate coverage. The critical technical insight is the design of a value function whose expected Bellman backup reduces to a deterministic, closed-form expression via "conditional Laplace embeddings"—summary statistics of the offspring distribution that can be precomputed offline. This eliminates Monte-Carlo sampling at planning time while simultaneously yielding a per-round objective with diminishing returns, enabling a greedy (1−1/e)-approximation guarantee.
The paper is technically well-constructed. The three propositions form a coherent chain: Proposition 1 establishes the closed-form backup, Proposition 2 proves diminishing returns, and Proposition 3 provides the approximation guarantee. The proofs (in the appendix) are detailed and correct from inspection—particularly the careful verification of log-supermodularity of the τ factors in Proposition 2.
However, several methodological concerns warrant attention:
The paper targets an important public health application—recruitment of hidden populations for disease surveillance and intervention. If the approach translates to real settings, it could meaningfully improve the efficiency of respondent-driven sampling campaigns, which are widely used in HIV/STI research globally. The framework is general enough to potentially apply to other networked recruitment problems (contact tracing, snowball sampling, viral marketing).
The technical contribution—amortizing generative model queries through Laplace embeddings to enable deterministic Bellman backups—is a clean idea that could find applications beyond this specific domain, wherever planning must be done over stochastic branching processes with typed entities.
The paper sits at the intersection of several active research threads: generative models for decision-making, adaptive submodularity, and AI for public health. The extension from i.i.d. to covariate-dependent arrivals is a natural and overdue modeling improvement for the RDS literature. The concurrent work by Pan et al. [19] (ICML 2026) on the i.i.d. version provides immediate context, making this a timely generalization.
The paper is a workshop paper (epiDAMIK @ KDD '26), and for this venue, the contribution is substantial. The combination of public health motivation, clean mathematical framework, and empirical demonstration is appropriate. However, for a full venue, the experimental evaluation would need significant strengthening: larger-scale experiments, sensitivity analyses, ablation studies (e.g., the value of the diffusion model vs. simpler conditional models), and ideally some form of real-data validation.
The connection to the concurrent Kangaslahti et al. [14] work on diffusion-driven network samples suggests an active research program, which increases the likelihood of follow-up validation work.
Generated Jun 9, 2026
Paper 2 tackles the widespread High-Dimensional Low-Sample Size (HDLSS) problem, which is prevalent across numerous fields like genomics and medicine. By advancing diffusion models for complex tabular data, it offers broad applicability and high potential for cross-disciplinary adoption. While Paper 1 presents rigorous and impactful work for public health epidemiology, Paper 2's fundamental methodological contribution to tabular data generation gives it a wider breadth of potential scientific impact.
Paper 2 likely has higher scientific impact due to timeliness and broad applicability: on-device safe LLM deployment is a central, fast-moving problem with immediate industry and societal relevance. Its systematic comparison across architectures/objectives and a clear, practical recipe (soft-prompt distillation with TV/KL) can be adopted widely, influencing safety engineering, edge AI, and model compression. Paper 1 is methodologically interesting and novel for adaptive peer-referral planning, but its impact is more domain-specific (public health recruitment/sampling) and relies on simulated evaluation, potentially narrowing near-term uptake.
Paper 1 presents a highly rigorous methodological advancement with strong theoretical guarantees (e.g., submodular optimization bounds) applied to a critical public health challenge. Its interdisciplinary impact on epidemiology and machine learning, particularly in managing hidden populations for infectious diseases, offers deeper scientific innovation compared to Paper 2's more application-focused MLOps approach for edge computing.
Paper 1 addresses a fundamental theoretical question about deep learning dynamics, revealing a hierarchy of Gram operators that govern information transport across layers. This has broad implications for understanding neural network training, kernel methods, and deep learning theory. While Paper 2 presents a novel and rigorous contribution to adaptive recruitment planning, its impact is narrower, targeting a specific public health methodology. Paper 1's insights into the mathematical structure of gradient descent in deep networks have potential to influence a much wider range of research in machine learning theory, optimization, and neural network design.
Paper 2 presents a novel methodological advance in planning under covariate-dependent arrivals, addressing critical limitations in previous i.i.d. models. Its application to peer-referral recruitment offers profound real-world public health impact for tracking infectious diseases in hidden populations. Furthermore, it provides strong theoretical guarantees and a new algorithm (GFP). Paper 1, while valuable, primarily offers an empirical benchmarking of existing GPT-2 models for a specific materials science application (OLEDs), making its broader scientific and methodological impact less expansive than Paper 2.
Paper 1 likely has higher scientific impact due to broader cross-field relevance (graph temporal modeling, long-tail learning, contrastive latent structuring) and a major real-world application area (biodiversity monitoring) with a widely used large-scale dataset (eBird), supporting timeliness and reproducibility. Its methodological contributions integrate spatio-temporal dynamics, community structure, and imbalance in one framework, which can transfer to other ecological and long-tailed spatiotemporal prediction problems. Paper 2 is rigorous and valuable for public health planning, but appears narrower in scope and is validated mainly in calibrated simulation rather than large-scale real deployments.
Paper 2 presents a novel, theoretically grounded algorithm (Generative Frontier Planning) for an important public health problem with clear mathematical contributions (submodularity guarantees, approximation bounds) and empirical validation on realistic simulations. It advances both methodology (combining generative models with combinatorial optimization) and application (hidden population recruitment). Paper 1, while useful, is primarily an engineering/integration contribution combining existing ideas (cheat sheets, decision support, AutoML) into a platform for non-experts, with less methodological novelty and narrower theoretical contribution.
Paper 2 addresses a critical real-world public health challenge (recruiting hidden populations for infectious disease interventions) using highly rigorous methodologies, including conditional generative models and theoretically grounded approximation guarantees. While Paper 1 presents a practical system for LLM tool integration, Paper 2 offers profound societal impact, deeper algorithmic innovation, and tackles a high-stakes, complex domain, resulting in a significantly broader scientific and real-world footprint.
Paper 2 (LH-NeF) addresses a fundamental challenge in representation learning—scalable, modality-agnostic neural field tokenization—with broad applicability across images, 3D shapes, and climate data. Its 42× memory reduction and 133× batch size improvement over meta-learning baselines represent substantial practical gains. The framework's generality across modalities gives it wider potential impact across computer vision, graphics, scientific computing, and generative modeling. Paper 1, while methodologically rigorous with strong theoretical guarantees, targets a narrow application domain (peer-referral recruitment for hidden populations), limiting its breadth of impact despite its real-world importance.
Paper 2 has broader potential scientific and real-world impact due to the universal prevalence of tabular data across nearly all scientific and industrial domains. While Paper 1 provides a rigorous and valuable contribution to public health and survey methodology, Paper 2's Geometry-Aware Tabular Diffusion method achieves state-of-the-art results with significantly fewer parameters (3.5x reduction) and demonstrates portable improvements across different architectures. This makes it highly scalable and immediately applicable to widespread data privacy and augmentation tasks.