Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection
Sihang Zeng, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen
Abstract
Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj-Evolve, a self-evolving multi-agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non-parametric memory, indexing rejection-sampled reasoning traces to retrieve similar patients as few-shot contexts. Second, multi-agent reinforcement learning (MARL) via reward-ranked fine-tuning parametrically optimizes inter-agent and agent-memory collaboration. A leave-one-out cross-retrieval strategy unifies the two, aligning training- and inference-time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj-Evolve outperforms 9 strong baselines on the overall population and a challenging never-smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent's prediction loss converges quickly while the worker agents' temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Traj-Evolve
1. Core Contribution
Traj-Evolve introduces a self-evolving multi-agent framework for longitudinal EHR modeling in lung cancer early detection. The key insight is that existing LLM-based multi-agent systems process patients in isolation—unlike clinicians who accumulate diagnostic experience from prior cases. The paper addresses this through two complementary mechanisms: (1) an Experience Pool (ExPool), a non-parametric vector database that indexes rejection-sampled reasoning traces for retrieval-augmented "patients-like-me" few-shot learning; and (2) Multi-Agent Reinforcement Learning (MARL) via reward-ranked fine-tuning that parametrically updates both worker and manager agents. A leave-one-out cross-retrieval strategy bridges the two mechanisms, ensuring training and inference distributions remain aligned.
The conceptual framing—transforming static patient-by-patient prediction into a continuously improving clinical learning system—is compelling and clinically motivated. The analogy to clinical expertise accumulation is sound, and the technical realization is well-designed.
2. Methodological Rigor
Strengths in design: The rejection sampling strategy for ExPool construction (Eq. 3-4) is thoughtful—retaining the "best" trace per patient regardless of correctness captures the system's decision boundary. The hard rejection filter for MARL (Eq. 9-10) ensures only clinically consistent reasoning enters the training pipeline. The decoupled optimization of worker and manager agents (Eqs. 14-15) is justified by their distinct roles and validated by the asymmetric learning curves observed in Figure 5.
Experimental rigor: The evaluation includes 9 baselines spanning 5 model families (clinical risk models, supervised ML, sequential DL, clinical BERT, LLM-based). Bootstrap confidence intervals over 1,000 resamples are reported. The inclusion of a challenging never-smoker subgroup (n=835) is clinically meaningful, as these patients are underserved by traditional smoking-centric risk models.
Concerns: The evaluation is single-institution and retrospective with a case-control design (1:10 matching), which inflates prevalence relative to real-world settings and may not reflect calibration under deployment conditions. The test sets are modest (n=1,000 overall, n=835 never-smokers), and with only 90 and 27 cases respectively, confidence intervals are wide (e.g., sensitivity ranges of ±0.086 and ±0.096 for the combined model). The AUROC improvements, while consistent, are incremental (0.814→0.860 overall; 0.775→0.835 never-smokers) and the overlapping confidence intervals in several metrics make it difficult to claim definitive superiority in individual metrics. The paper reports only a single epoch of MARL training and m=4 rollouts—the sensitivity to these choices is not explored. Additionally, the reliance on GPT-OSS-20B (a proprietary model) limits reproducibility.
3. Potential Impact
Clinical: Lung cancer early detection from routine EHRs could complement screening programs (LDCT), especially for populations not currently eligible for screening (e.g., never-smokers, who comprise ~15-20% of lung cancer cases). The system's robustness in the never-smoker subgroup is a notable practical strength.
Methodological: The combination of non-parametric memory evolution and parametric MARL in a multi-agent clinical reasoning system is novel. The finding that ExPool improves specificity while MARL improves sensitivity (Figure 6) provides actionable design principles for combining retrieval-augmented and fine-tuning-based approaches. The leave-one-out cross-retrieval strategy for bridging training and inference distributions is a clean solution to a real problem in retrieval-augmented training.
Broader AI: The evolutionary dynamics analysis (Sections 5.2-5.4) offers generalizable insights: the shift from diversity-optimal to specificity-optimal retrieval as the pool grows, and the asymmetric convergence of manager vs. worker agents, could inform self-evolving agent design beyond healthcare.
4. Timeliness & Relevance
The paper sits at the intersection of several active research fronts: LLM-based clinical reasoning, multi-agent systems, self-evolving agents, and RAG for healthcare. The limitations of static LLM systems for clinical prediction are widely recognized, and the "self-evolving" paradigm is gaining traction (as evidenced by the 2025 survey by Gao et al.). The application to longitudinal EHR modeling for cancer early detection is timely given the growing availability of large-scale EHR data and the push for AI-augmented clinical decision support.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations: The paper claims to be "the first self-evolving multi-agent framework for longitudinal EHR modelling applied to a real-world clinical prediction task," which appears accurate based on the literature review. The AUPRC values (0.32 overall, 0.28 never-smokers) reflect the inherent difficulty of the low-prevalence prediction task but remain practically limited for clinical deployment. The ablation showing complementary effects on sensitivity vs. specificity (Figure 6) is perhaps the most impactful analytical contribution, as it provides mechanistic understanding rather than just empirical gains.
Generated Jun 3, 2026
Comparison History (22)
Paper 1 presents a foundational step toward recursive self-improvement in AI through bilevel autoresearch. While Paper 2 offers a rigorous and highly valuable application in healthcare, Paper 1 introduces a paradigm shift in how AI systems might autonomously discover and optimize their own research methodologies. This meta-level capability has a significantly broader potential impact across all scientific domains by accelerating the pace of automated discovery itself.
Paper 1 addresses a broadly significant and timely societal issue—how routine AI interactions reshape human emotional connections—with large-scale longitudinal evidence from an OpenAI collaboration. Its findings have immediate policy implications affecting billions of AI users and span psychology, HCI, policy, and AI ethics. The 10.3% decrease in human support-seeking preference is a striking, widely communicable result. Paper 2, while technically strong and clinically valuable, addresses a narrower domain (lung cancer trajectory modeling) with incremental methodological advances in multi-agent LLM systems, limiting its cross-disciplinary reach.
Paper 2 has higher potential impact due to broader applicability and timeliness: a hybrid architecture that fuses task-specific small models into a transformer with selective activation and grounded action substrates can influence multimodal AI systems, efficiency, and deployment across many domains (OCR, GUI, speech, web, code). Its reported gains span numerous established benchmarks, suggesting strong engineering/methodological rigor and immediate real-world utility. Paper 1 is innovative for clinical EHR trajectory modeling with evolving memory + MARL, but its impact is narrower (lung cancer/EHR) and more domain-specific, limiting cross-field breadth.
Traj-Evolve presents a more rigorous and impactful contribution: it addresses a critical healthcare problem (lung cancer early detection), introduces a novel self-evolving multi-agent architecture combining non-parametric memory with MARL, demonstrates superiority over 9 strong baselines on real clinical data, and provides actionable insights about the complementary dynamics of its mechanisms. Paper 1, while creative in applying Navya-Nyaya logic, is limited by a tiny dataset (55 problems), a small model, and preliminary results where 100% accuracy on held-out data with only 40% format adherence raises concerns about evaluation rigor and generalizability.
Paper 1 offers a highly rigorous, methodologically innovative approach combining non-parametric memory and MARL to solve a specific, high-impact clinical problem (lung cancer early detection). It demonstrates strong empirical validation against 9 baselines using real-world longitudinal EHRs. Paper 2 introduces an interesting conceptual arena for benchmarking, but its abstract lacks the empirical depth, detailed methodology, and immediate real-world utility demonstrated in Paper 1.
Paper 1 has broader, cross-domain impact and timeliness: it targets a pervasive, under-instrumented failure mode of MILP-based decision engines (post-solve brittleness) and proposes a unifying evaluation/verification layer with solver-backed certification concepts that could influence optimization, operations research, ML-for-decision-making, and safety-critical deployment practices. While Paper 2 is application-relevant and likely impactful in clinical prediction, its methods (retrieval-augmented memory + multi-agent RL fine-tuning) are closer to fast-moving incremental advances in LLM systems and may face reproducibility/generalizability hurdles across institutions.
Paper 1 presents a novel, technically rigorous system (Traj-Evolve) combining non-parametric memory with multi-agent reinforcement learning for patient trajectory modeling—a methodologically innovative contribution with direct clinical application in lung cancer early detection. It introduces concrete mechanisms (ExPool, MARL, cross-retrieval strategy) validated against 9 baselines with detailed ablation analysis. Paper 2 is a scoping review synthesizing existing work rather than introducing new methods. While useful for the dental AI community, reviews generally have less scientific impact than novel methodological contributions, and Paper 1's innovations in self-evolving multi-agent systems have broader applicability beyond its specific clinical domain.
Paper 1 addresses a critical bottleneck in LLM inference—batching efficiency across different hardware architectures. By introducing a dynamically switching hybrid scheduler with closed-form conditions, it offers a fundamental infrastructure improvement. Such systems-level optimizations for LLMs typically see rapid, widespread adoption in serving frameworks, yielding massive broad-scale impact and high citation counts across the entire AI ecosystem. While Paper 2 presents an innovative healthcare application, Paper 1's foundational contribution to AI infrastructure gives it a broader and more immediate potential scientific and practical impact.
Paper 1 addresses a fundamental problem in causal inference—evaluating bivariate causal statements without ground truth—with broad applicability across many scientific fields. Its novel compatibility scoring framework, independence from the faithfulness assumption, and application to assessing LLM-generated causal claims make it highly relevant and timely. Paper 2, while technically strong and clinically relevant for lung cancer detection, is more narrowly focused on a specific application domain. Paper 1's methodological contribution to causal reasoning foundations gives it broader potential impact across disciplines.
Paper 1 presents a highly impactful real-world application (lung cancer early detection) using a novel self-evolving multi-agent system. By addressing the complexities of multimodal EHRs and combining non-parametric memory with MARL, it advances clinical AI reasoning. While Paper 2 offers a valuable tool for LLM alignment, Paper 1's direct potential to improve patient outcomes and its innovative methodological approach to medical time-series data give it a higher potential for broad societal and scientific impact.
Paper 2 has higher potential impact due to broader cross-field relevance and timeliness: coupling LLMs with physics-based/thermodynamic-kinetic simulators targets a central bottleneck in materials discovery (synthesis feasibility), with applications across inorganic systems and experimental planning. The hybrid evaluation framework is a generally extensible paradigm bridging AI and physical sciences. Paper 1 is innovative and potentially clinically useful, but is more domain-specific (lung cancer EHR trajectories) and its methodological claims (self-evolving memory + MARL) likely generalize less broadly than a physics-grounded LLM-simulation coupling approach.
Paper 2 addresses a critical, high-impact real-world problem (lung cancer early detection) using an innovative self-evolving multi-agent architecture that mirrors clinical reasoning. Its combination of non-parametric memory and multi-agent reinforcement learning for complex multimodal EHR data offers profound potential for direct clinical application and life-saving outcomes. This gives it a broader and more significant scientific and societal impact compared to Paper 1's more narrowly focused algorithmic improvements for LLM search agent training.
Paper 1 introduces a novel architectural design principle (score-level fusion) for hybrid language models that defines a new design axis beyond existing block-level and head-level paradigms. It addresses a fundamental challenge in language modeling with an elegant solution requiring no custom kernels, making it broadly applicable. Its impact spans the entire LLM community. Paper 2, while rigorous and clinically relevant, addresses a narrower domain (lung cancer early detection) with incremental advances combining existing techniques (multi-agent systems, RAG, MARL). Paper 1's architectural innovation has broader potential to influence future model designs across NLP.
ThoughtFold addresses a critical bottleneck—over-thinking and high token consumption—in Large Reasoning Models (LRMs). By significantly reducing token usage while maintaining accuracy, it offers foundational improvements applicable to any domain utilizing reasoning LLMs. While Paper 1 presents an innovative and valuable medical application, Paper 2's methodological advancement in fundamental AI efficiency grants it a much broader potential impact and higher timeliness across the entire artificial intelligence community.
Paper 1 addresses a fundamental and broadly applicable vulnerability in RLVR—buggy verifiers corrupting learned policies—which affects the entire rapidly growing field of LLM alignment via verifiable rewards (math, code, tool use). Its lightweight fuzzing framework is immediately actionable across many domains and highlights a systemic risk as RLVR scales. Paper 2, while technically strong and clinically relevant, targets a narrower application (lung cancer prediction from EHRs) with a complex multi-agent architecture whose generalizability is less clear. Paper 1's timeliness and breadth of impact across the RLVR ecosystem give it higher potential influence.
Paper 1 tackles a high-stakes clinical problem (lung cancer early detection) using a highly novel self-evolving multi-agent system combining non-parametric memory and MARL. Its methodology is highly innovative for LLM-based reasoning on longitudinal data. In contrast, Paper 2 provides more incremental architectural improvements (masking, TF-IDF) to graph transformers for database autocomplete tasks. Paper 1's combination of cutting-edge AI techniques with a profound real-world healthcare application gives it a substantially higher potential for scientific and societal impact.
Traj-Evolve addresses a concrete, high-impact clinical problem (lung cancer early detection) with a novel combination of experience-based memory, multi-agent reinforcement learning, and retrieval-augmented reasoning over longitudinal EHRs. Its methodological contributions—self-evolving agents, experience pools, and MARL-based optimization—are broadly applicable to other clinical trajectory modeling tasks. Paper 1 proposes a theoretical governance framework for agentic AI authorization, which is timely but more niche and infrastructural. Paper 2's empirical results on a real clinical task, novel architectural contributions, and direct potential to improve patient outcomes give it broader and more immediate scientific impact.
Paper 2 likely has higher impact: it targets a general RL mechanism failure in multimodal/visual reasoning, proposes a principled fix (vision-anchored token selection) and validates it across model scales with controlled studies and ablations. The contribution is broadly applicable to multimodal RL/LLMs and timely given rapid growth of vision-language RL. Paper 1 is innovative and clinically relevant, but its scope is narrower (lung cancer EHR trajectories) and real-world deployment faces domain/data/regulatory barriers, limiting near-term breadth despite strong application value.
Paper 1 likely has higher scientific impact due to direct clinical relevance (lung cancer early detection) and clear real-world applicability in healthcare, where gains in sensitivity/specificity can translate to patient outcomes. Its combination of nonparametric experience memory with MARL under a unified retrieval strategy is a technically novel, timely approach for long-context, sparse EHR modeling, and could generalize to other longitudinal clinical prediction tasks. Paper 2 introduces an interesting human-centric UGC evaluation paradigm and benchmark, but its societal/industrial applications are less high-stakes and its constructs (personas, “community mind”) may be harder to validate rigorously.
Traj-Evolve presents a more novel and broadly impactful contribution: a self-evolving multi-agent system combining non-parametric memory with multi-agent reinforcement learning for clinical trajectory modeling. It addresses a critical healthcare problem (lung cancer early detection), demonstrates methodological innovation (experience pool, MARL, cross-retrieval strategy), and has clear real-world clinical applications. Paper 1, while technically sound, addresses a narrower CS education problem (automated grading) with incremental improvements to existing transformer fine-tuning approaches, limiting its broader scientific impact.