Weixian Xu, Shilong Liu, Mengdi Wang
In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.
EEVEE introduces the first multi-dataset test-time prompt learning framework for LLM agents. The key insight is that existing prompt learning methods (GEPA, ACE, Combee) are designed for single-dataset/benchmark settings and suffer from cross-dataset interference when exposed to heterogeneous task streams. EEVEE addresses this through two main innovations: (1) a router that partitions incoming inputs into task clusters and assigns them to specialized prompt configurations, and (2) a router-prompt co-evolution strategy with a three-stage training process (initialization, exploration, convergence) that jointly optimizes routing decisions and prompt content through interleaved phases.
The problem formulation itself—multi-dataset test-time prompt learning—is a meaningful contribution. Real-world deployed agents do face mixed task streams, and the paper convincingly demonstrates (Figure 1) that existing methods accumulate negative retention as more tasks are introduced, while EEVEE maintains positive cumulative gains.
Practical relevance: The multi-dataset prompt learning setting is genuinely important. Production LLM agents serve diverse queries, and maintaining separate prompt optimization pipelines per task type is costly. EEVEE's router-based approach offers a principled alternative.
Token efficiency: EEVEE uses 4.32k tokens per example vs. ACE's 21.30k (~4.9× reduction), making it practically viable. The overhead compared to GEPA (3.47k) is modest.
Cross-model transfer: Prompts learned on Qwen3-4B-Instruct transfer to DeepSeek-V3.2 (39.75→54.10 average), suggesting the learned prompts capture task-general strategies rather than model-specific artifacts.
Limitations on impact: The reliance on labeled adaptation data, the moderate benchmark coverage, and the lack of truly online/streaming adaptation (requiring prepared adaptation sets) constrain near-term practical deployment. The case study revealing that prompt learning struggles with knowledge-intensive QA (GPQA Diamond consistently regresses) identifies a fundamental limitation of the approach.
The paper addresses a genuine gap in the prompt learning literature. As LLM agents are increasingly deployed in multi-task settings, the single-benchmark assumption becomes untenable. The timing is appropriate given the recent surge in prompt optimization methods (GEPA at ICLR 2026, ACE at ICLR 2026, Combee 2026). The paper positions itself well against these contemporaneous works.
However, the field is moving rapidly toward more powerful base models and reasoning-capable systems. Whether test-time prompt learning remains the dominant adaptation mechanism (versus fine-tuning, tool use, or in-context learning with retrieval) is uncertain, potentially limiting long-term relevance.
The paper's case study insight—that prompt learning excels at learning reusable procedures but struggles with domain knowledge—is perhaps its most generalizable contribution. This finding has implications beyond the specific framework and could guide future work on when to apply prompt learning versus other adaptation strategies.
The writing is generally clear, though the framework involves many interacting components (router scoring with multiple terms, annealing schedules, Pareto-front maintenance, three training stages) that may hinder adoption and reproducibility despite code release.
Generated Jun 10, 2026
Paper 2 addresses the critical challenge of continuous, real-world deployment of LLM agents across diverse datasets. This is a highly timely problem with broad applicability across AI. While Paper 1 offers a solid technical improvement for robotic manipulation, the broader applicability of test-time learning for foundation models and the substantial performance gains over state-of-the-art LLMs suggest Paper 2 will have a larger cross-disciplinary impact.
Paper 2 proposes a fundamental architectural improvement to Mixture-of-Experts (MoE) models, backed by theoretical guarantees and validated across large-scale pretraining. Since MoE is the backbone of state-of-the-art LLMs, enhancing router efficiency and alignment has massive, broad implications for foundational model training. Paper 1 offers a valuable but more applied contribution to test-time prompt learning for agents, which tends to have a narrower, less foundational impact compared to core architectural advancements.
Paper 1 addresses a clinically important problem (survival analysis) with a novel adaptation of tabular foundation models, combining pretrained representations with survival-aware objectives. It demonstrates rigorous evaluation on multiple benchmarks including large-scale clinical datasets (MIMIC-IV, eICU), showing meaningful improvements. The work bridges foundation models and survival analysis—a significant methodological contribution with direct clinical applications. Paper 2, while addressing a practical problem in test-time prompt learning, is more incremental in the LLM agent space and has narrower impact scope. Paper 1's clinical relevance and cross-disciplinary contribution give it higher potential impact.
Paper 2 addresses a highly timely and impactful problem in the rapidly growing field of LLM agents: test-time prompt learning across heterogeneous real-world data streams. Its potential for broad real-world application and significant empirical improvements over state-of-the-art models suggest a wider adoption and broader impact compared to Paper 1, which, while theoretically rigorous, focuses on a more specialized algorithmic improvement within the narrower domain of bandit theory.
Paper 2 (EEVEE) likely has higher impact: it proposes a scalable, multi-dataset test-time prompt learning framework for agents—a timely problem aligned with real-world deployment where task streams are heterogeneous. The router + prompt co-evolution mechanism is a concrete methodological contribution with broad applicability across domains (agent systems, continual/online adaptation, prompt optimization). Reported large gains over strong baselines across multiple datasets suggest practical relevance. Paper 1 offers an important diagnostic of ICL limits for structured data, but is more narrowly scoped to a specific failure mode and primarily characterizes limitations rather than providing a widely deployable solution.
Paper 2 presents a more fundamentally novel and interdisciplinary contribution by demonstrating that brain fMRI signals can directly enhance LLM reasoning, moving beyond correlation to causal guidance. This bridges neuroscience and AI in a groundbreaking way, with broad implications for both fields. The finding that brain signals provide orthogonal gains to language-only supervision across 10 LLMs of varying scales is particularly impactful. Paper 1, while practically useful, represents an incremental advance in prompt engineering. Paper 2's novelty, cross-disciplinary breadth, and potential to open entirely new research directions give it higher scientific impact.
Paper 1 likely has higher near-term scientific impact: it introduces a practical, multi-dataset test-time prompt learning framework for LLM agents addressing a widely felt deployment gap (heterogeneous real-world task streams), with large empirical gains on modern foundation models and clear applicability to agentic systems. Its breadth spans ML systems, NLP, and autonomous agents, and it is highly timely given current LLM deployment trends. Paper 2 is methodologically rigorous with strong theory and lower bounds, but its impact is narrower (online learning theory for drifting halfspaces) and may diffuse more slowly into practice.
OncoTraj addresses a critical unmet need in precision oncology by providing the first public benchmark for longitudinal resistance prediction in EGFR-mutant NSCLC. It creates reusable infrastructure (harmonized dataset, evaluation harness, leakage-audited splits) that can catalyze an entire research community around a clinically important problem. Its honest reporting of negative results (no model beats chance with current features) provides actionable insight directing future data collection (serial ctDNA). While Paper 1 offers incremental improvements in prompt learning for LLM agents, Paper 2 has broader cross-disciplinary impact spanning oncology, genomics, and ML, with direct translational potential.
Paper 2 (FTM) has higher likely scientific impact due to stronger methodological novelty and broader cross-field relevance: it offers a principled surrogate modeling framework for stochastic/chaotic dynamics that avoids drift/diffusion/score estimation, includes stability analysis, and targets widely important problems (turbulence, stochastic PDEs) with clear real-world applications in physics, climate, engineering, and UQ. Paper 1 is timely and practically useful for LLM agents, but test-time prompt routing/co-evolution is more incremental within a fast-moving, benchmark-driven area and may be superseded by model-architecture advances.
Paper 2 addresses a fundamental theoretical question in reinforcement learning for LLM post-training—advantage estimation for inference-time objectives like max@K. It provides rigorous theoretical contributions (unbiased baselines, centered advantages, unified framework) with broad applicability across reasoning model training. Paper 1, while practically useful, is more incremental—combining routing with prompt learning for multi-dataset settings. Paper 2's contributions to RL foundations for LLM training are more likely to influence a wider range of future work given the rapid growth of RL-based LLM post-training.