Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

Sihang Zeng, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

Jun 1, 2026

arXiv:2606.02812v1 PDF

cs.AI(primary)cs.CL

#1968of 3355·Artificial Intelligence

#1968 of 3355 · Artificial Intelligence

Tournament Score

1384±46

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty7.5

Clarity7.5

Tournament Score

1384±46

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj-Evolve, a self-evolving multi-agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non-parametric memory, indexing rejection-sampled reasoning traces to retrieve similar patients as few-shot contexts. Second, multi-agent reinforcement learning (MARL) via reward-ranked fine-tuning parametrically optimizes inter-agent and agent-memory collaboration. A leave-one-out cross-retrieval strategy unifies the two, aligning training- and inference-time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj-Evolve outperforms 9 strong baselines on the overall population and a challenging never-smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent's prediction loss converges quickly while the worker agents' temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Traj-Evolve

1. Core Contribution

Traj-Evolve introduces a self-evolving multi-agent framework for longitudinal EHR modeling in lung cancer early detection. The key insight is that existing LLM-based multi-agent systems process patients in isolation—unlike clinicians who accumulate diagnostic experience from prior cases. The paper addresses this through two complementary mechanisms: (1) an Experience Pool (ExPool), a non-parametric vector database that indexes rejection-sampled reasoning traces for retrieval-augmented "patients-like-me" few-shot learning; and (2) Multi-Agent Reinforcement Learning (MARL) via reward-ranked fine-tuning that parametrically updates both worker and manager agents. A leave-one-out cross-retrieval strategy bridges the two mechanisms, ensuring training and inference distributions remain aligned.

The conceptual framing—transforming static patient-by-patient prediction into a continuously improving clinical learning system—is compelling and clinically motivated. The analogy to clinical expertise accumulation is sound, and the technical realization is well-designed.

2. Methodological Rigor

Strengths in design: The rejection sampling strategy for ExPool construction (Eq. 3-4) is thoughtful—retaining the "best" trace per patient regardless of correctness captures the system's decision boundary. The hard rejection filter for MARL (Eq. 9-10) ensures only clinically consistent reasoning enters the training pipeline. The decoupled optimization of worker and manager agents (Eqs. 14-15) is justified by their distinct roles and validated by the asymmetric learning curves observed in Figure 5.

Experimental rigor: The evaluation includes 9 baselines spanning 5 model families (clinical risk models, supervised ML, sequential DL, clinical BERT, LLM-based). Bootstrap confidence intervals over 1,000 resamples are reported. The inclusion of a challenging never-smoker subgroup (n=835) is clinically meaningful, as these patients are underserved by traditional smoking-centric risk models.

Concerns: The evaluation is single-institution and retrospective with a case-control design (1:10 matching), which inflates prevalence relative to real-world settings and may not reflect calibration under deployment conditions. The test sets are modest (n=1,000 overall, n=835 never-smokers), and with only 90 and 27 cases respectively, confidence intervals are wide (e.g., sensitivity ranges of ±0.086 and ±0.096 for the combined model). The AUROC improvements, while consistent, are incremental (0.814→0.860 overall; 0.775→0.835 never-smokers) and the overlapping confidence intervals in several metrics make it difficult to claim definitive superiority in individual metrics. The paper reports only a single epoch of MARL training and m=4 rollouts—the sensitivity to these choices is not explored. Additionally, the reliance on GPT-OSS-20B (a proprietary model) limits reproducibility.

3. Potential Impact

Clinical: Lung cancer early detection from routine EHRs could complement screening programs (LDCT), especially for populations not currently eligible for screening (e.g., never-smokers, who comprise ~15-20% of lung cancer cases). The system's robustness in the never-smoker subgroup is a notable practical strength.

Methodological: The combination of non-parametric memory evolution and parametric MARL in a multi-agent clinical reasoning system is novel. The finding that ExPool improves specificity while MARL improves sensitivity (Figure 6) provides actionable design principles for combining retrieval-augmented and fine-tuning-based approaches. The leave-one-out cross-retrieval strategy for bridging training and inference distributions is a clean solution to a real problem in retrieval-augmented training.

Broader AI: The evolutionary dynamics analysis (Sections 5.2-5.4) offers generalizable insights: the shift from diversity-optimal to specificity-optimal retrieval as the pool grows, and the asymmetric convergence of manager vs. worker agents, could inform self-evolving agent design beyond healthcare.

4. Timeliness & Relevance

The paper sits at the intersection of several active research fronts: LLM-based clinical reasoning, multi-agent systems, self-evolving agents, and RAG for healthcare. The limitations of static LLM systems for clinical prediction are widely recognized, and the "self-evolving" paradigm is gaining traction (as evidenced by the 2025 survey by Gao et al.). The application to longitudinal EHR modeling for cancer early detection is timely given the growing availability of large-scale EHR data and the push for AI-augmented clinical decision support.

5. Strengths & Limitations

Key Strengths:

Clinically grounded motivation with a clear analogy to clinician experience accumulation

Thoughtful integration of two complementary self-evolving mechanisms with principled unification

Comprehensive baseline comparison across model families

Insightful analysis of evolutionary dynamics (Figures 4-6) that goes beyond just reporting metrics

Robustness in the clinically important never-smoker subgroup

The case study (Appendix C) convincingly illustrates how ExPool enables comparative clinical reasoning

Notable Limitations:

Single-institution, retrospective case-control design with modest case counts

No prospective or external validation

Reliance on proprietary LLM (GPT-OSS-20B) limits reproducibility

The paper acknowledges but does not address the latency of ground-truth labels in real deployment

Computational cost analysis is absent—running 4 rollouts per patient through a multi-agent chain-of-agents system is expensive

Process reward signals (mentioned in Limitations) could strengthen MARL but remain unexplored

The LLM-as-a-judge evaluation (Figure 3) using another GPT model introduces potential bias

Additional Observations: The paper claims to be "the first self-evolving multi-agent framework for longitudinal EHR modelling applied to a real-world clinical prediction task," which appears accurate based on the literature review. The AUPRC values (0.32 overall, 0.28 never-smokers) reflect the inherent difficulty of the low-prevalence prediction task but remain practically limited for clinical deployment. The ablation showing complementary effects on sensitivity vs. specificity (Figure 6) is perhaps the most impactful analytical contribution, as it provides mechanistic understanding rather than just empirical gains.

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 7.5Clarity 7.5

Generated Jun 3, 2026

Comparison History (22)

vs. Bilevel Autoresearch: Meta-Autoresearching Itself

gemini-3.16/5/2026

Paper 1 presents a foundational step toward recursive self-improvement in AI through bilevel autoresearch. While Paper 2 offers a rigorous and highly valuable application in healthcare, Paper 1 introduces a paradigm shift in how AI systems might autonomously discover and optimize their own research methodologies. This meta-level capability has a significantly broader potential impact across all scientific domains by accelerating the pace of automated discovery itself.

vs. Stumbling Into AI Emotional Dependence: How Routine AI Interactions Reshape Human Connection

claude-opus-4.66/5/2026

Paper 1 addresses a broadly significant and timely societal issue—how routine AI interactions reshape human emotional connections—with large-scale longitudinal evidence from an OpenAI collaboration. Its findings have immediate policy implications affecting billions of AI users and span psychology, HCI, policy, and AI ethics. The 10.3% decrease in human support-seeking preference is a striking, widely communicable result. Paper 2, while technically strong and clinically valuable, addresses a narrower domain (lung cancer trajectory modeling) with incremental methodological advances in multi-agent LLM systems, limiting its cross-disciplinary reach.

vs. Interfaze: The Future of AI is built on Task-Specific Small Models

gpt-5.26/5/2026

Paper 2 has higher potential impact due to broader applicability and timeliness: a hybrid architecture that fuses task-specific small models into a transformer with selective activation and grounded action substrates can influence multimodal AI systems, efficiency, and deployment across many domains (OCR, GUI, speech, web, code). Its reported gains span numerous established benchmarks, suggesting strong engineering/methodological rigor and immediate real-world utility. Paper 1 is innovative for clinical EHR trajectory modeling with evolving memory + MARL, but its impact is narrower (lung cancer/EHR) and more domain-specific, limiting cross-field breadth.

vs. Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

claude-opus-4.66/5/2026

Traj-Evolve presents a more rigorous and impactful contribution: it addresses a critical healthcare problem (lung cancer early detection), introduces a novel self-evolving multi-agent architecture combining non-parametric memory with MARL, demonstrates superiority over 9 strong baselines on real clinical data, and provides actionable insights about the complementary dynamics of its mechanisms. Paper 1, while creative in applying Navya-Nyaya logic, is limited by a tiny dataset (55 problems), a small model, and preliminary results where 100% accuracy on held-out data with only 40% format adherence raises concerns about evaluation rigor and generalizability.

vs. OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

gemini-3.16/5/2026

Paper 1 offers a highly rigorous, methodologically innovative approach combining non-parametric memory and MARL to solve a specific, high-impact clinical problem (lung cancer early detection). It demonstrates strong empirical validation against 9 baselines using real-world longitudinal EHRs. Paper 2 introduces an interesting conceptual arena for benchmarking, but its abstract lacks the empirical depth, detailed methodology, and immediate real-world utility demonstrated in Paper 1.

vs. Position Paper: Post-Solve Robustness in Decision Engines: Feasible Regions and Smoothness Under Perturbations

gpt-5.26/5/2026

Paper 1 has broader, cross-domain impact and timeliness: it targets a pervasive, under-instrumented failure mode of MILP-based decision engines (post-solve brittleness) and proposes a unifying evaluation/verification layer with solver-backed certification concepts that could influence optimization, operations research, ML-for-decision-making, and safety-critical deployment practices. While Paper 2 is application-relevant and likely impactful in clinical prediction, its methods (retrieval-augmented memory + multi-agent RL fine-tuning) are closer to fast-moving incremental advances in LLM systems and may face reproducibility/generalizability hurdles across institutions.

vs. Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

claude-opus-4.66/3/2026

Paper 1 presents a novel, technically rigorous system (Traj-Evolve) combining non-parametric memory with multi-agent reinforcement learning for patient trajectory modeling—a methodologically innovative contribution with direct clinical application in lung cancer early detection. It introduces concrete mechanisms (ExPool, MARL, cross-retrieval strategy) validated against 9 baselines with detailed ablation analysis. Paper 2 is a scoping review synthesizing existing work rather than introducing new methods. While useful for the dental AI community, reviews generally have less scientific impact than novel methodological contributions, and Paper 1's innovations in self-evolving multi-agent systems have broader applicability beyond its specific clinical domain.

vs. Threshold-Based Exclusive Batching for LLM Inference

gemini-3.16/3/2026

Paper 1 addresses a critical bottleneck in LLM inference—batching efficiency across different hardware architectures. By introducing a dynamically switching hybrid scheduler with closed-form conditions, it offers a fundamental infrastructure improvement. Such systems-level optimizations for LLMs typically see rapid, widespread adoption in serving frameworks, yielding massive broad-scale impact and high citation counts across the entire AI ecosystem. While Paper 2 presents an innovative healthcare application, Paper 1's foundational contribution to AI infrastructure gives it a broader and more immediate potential scientific and practical impact.

vs. Evaluating Bivariate Causal Statements Based on Mutual Compatibility

claude-opus-4.66/3/2026

Paper 1 addresses a fundamental problem in causal inference—evaluating bivariate causal statements without ground truth—with broad applicability across many scientific fields. Its novel compatibility scoring framework, independence from the faithfulness assumption, and application to assessing LLM-generated causal claims make it highly relevant and timely. Paper 2, while technically strong and clinically relevant for lung cancer detection, is more narrowly focused on a specific application domain. Paper 1's methodological contribution to causal reasoning foundations gives it broader potential impact across disciplines.

vs. ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment

gemini-3.16/3/2026

Paper 1 presents a highly impactful real-world application (lung cancer early detection) using a novel self-evolving multi-agent system. By addressing the complexities of multimodal EHRs and combining non-parametric memory with MARL, it advances clinical AI reasoning. While Paper 2 offers a valuable tool for LLM alignment, Paper 1's direct potential to improve patient outcomes and its innovative methodological approach to medical time-series data give it a higher potential for broad societal and scientific impact.

vs. Coupling Language Models with Physics-based Simulation for Synthesis of Inorganic Materials

gpt-5.26/3/2026

Paper 2 has higher potential impact due to broader cross-field relevance and timeliness: coupling LLMs with physics-based/thermodynamic-kinetic simulators targets a central bottleneck in materials discovery (synthesis feasibility), with applications across inorganic systems and experimental planning. The hybrid evaluation framework is a generally extensible paradigm bridging AI and physical sciences. Paper 1 is innovative and potentially clinically useful, but is more domain-specific (lung cancer EHR trajectories) and its methodological claims (self-evolving memory + MARL) likely generalize less broadly than a physics-grounded LLM-simulation coupling approach.

vs. CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

gemini-3.16/3/2026

Paper 2 addresses a critical, high-impact real-world problem (lung cancer early detection) using an innovative self-evolving multi-agent architecture that mirrors clinical reasoning. Its combination of non-parametric memory and multi-agent reinforcement learning for complex multimodal EHR data offers profound potential for direct clinical application and life-saving outcomes. This gives it a broader and more significant scientific and societal impact compared to Paper 1's more narrowly focused algorithmic improvements for LLM search agent training.

vs. Forget Attention: Importance-Aware Attention Is All You Need

claude-opus-4.66/3/2026

Paper 1 introduces a novel architectural design principle (score-level fusion) for hybrid language models that defines a new design axis beyond existing block-level and head-level paradigms. It addresses a fundamental challenge in language modeling with an elegant solution requiring no custom kernels, making it broadly applicable. Its impact spans the entire LLM community. Paper 2, while rigorous and clinically relevant, addresses a narrower domain (lung cancer early detection) with incremental advances combining existing techniques (multi-agent systems, RAG, MARL). Paper 1's architectural innovation has broader potential to influence future model designs across NLP.

vs. ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

gemini-3.16/3/2026

ThoughtFold addresses a critical bottleneck—over-thinking and high token consumption—in Large Reasoning Models (LRMs). By significantly reducing token usage while maintaining accuracy, it offers foundational improvements applicable to any domain utilizing reasoning LLMs. While Paper 1 presents an innovative and valuable medical application, Paper 2's methodological advancement in fundamental AI efficiency grants it a much broader potential impact and higher timeliness across the entire artificial intelligence community.

vs. Before the Model Learns the Bug:Fuzzing RLVR Verifiers

claude-opus-4.66/3/2026

Paper 1 addresses a fundamental and broadly applicable vulnerability in RLVR—buggy verifiers corrupting learned policies—which affects the entire rapidly growing field of LLM alignment via verifiable rewards (math, code, tool use). Its lightweight fuzzing framework is immediately actionable across many domains and highlights a systemic risk as RLVR scales. Paper 2, while technically strong and clinically relevant, targets a narrower application (lung cancer prediction from EHRs) with a complex multi-agent architecture whose generalizability is less clear. Paper 1's timeliness and breadth of impact across the RLVR ecosystem give it higher potential influence.

vs. RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

gemini-3.16/3/2026

Paper 1 tackles a high-stakes clinical problem (lung cancer early detection) using a highly novel self-evolving multi-agent system combining non-parametric memory and MARL. Its methodology is highly innovative for LLM-based reasoning on longitudinal data. In contrast, Paper 2 provides more incremental architectural improvements (masking, TF-IDF) to graph transformers for database autocomplete tasks. Paper 1's combination of cutting-edge AI techniques with a profound real-world healthcare application gives it a substantially higher potential for scientific and societal impact.

vs. Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI

claude-opus-4.66/3/2026

Traj-Evolve addresses a concrete, high-impact clinical problem (lung cancer early detection) with a novel combination of experience-based memory, multi-agent reinforcement learning, and retrieval-augmented reasoning over longitudinal EHRs. Its methodological contributions—self-evolving agents, experience pools, and MARL-based optimization—are broadly applicable to other clinical trajectory modeling tasks. Paper 1 proposes a theoretical governance framework for agentic AI authorization, which is timely but more niche and infrastructural. Paper 2's empirical results on a real clinical task, novel architectural contributions, and direct potential to improve patient outcomes give it broader and more immediate scientific impact.

vs. Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

gpt-5.26/3/2026

Paper 2 likely has higher impact: it targets a general RL mechanism failure in multimodal/visual reasoning, proposes a principled fix (vision-anchored token selection) and validates it across model scales with controlled studies and ablations. The contribution is broadly applicable to multimodal RL/LLMs and timely given rapid growth of vision-language RL. Paper 1 is innovative and clinically relevant, but its scope is narrower (lung cancer EHR trajectories) and real-world deployment faces domain/data/regulatory barriers, limiting near-term breadth despite strong application value.

vs. Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

gpt-5.26/3/2026

Paper 1 likely has higher scientific impact due to direct clinical relevance (lung cancer early detection) and clear real-world applicability in healthcare, where gains in sensitivity/specificity can translate to patient outcomes. Its combination of nonparametric experience memory with MARL under a unified retrieval strategy is a technically novel, timely approach for long-context, sparse EHR modeling, and could generalize to other longitudinal clinical prediction tasks. Paper 2 introduces an interesting human-centric UGC evaluation paradigm and benchmark, but its societal/industrial applications are less high-stakes and its constructs (personas, “community mind”) may be harder to validate rigorously.

vs. Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria

claude-opus-4.66/3/2026

Traj-Evolve presents a more novel and broadly impactful contribution: a self-evolving multi-agent system combining non-parametric memory with multi-agent reinforcement learning for clinical trajectory modeling. It addresses a critical healthcare problem (lung cancer early detection), demonstrates methodological innovation (experience pool, MARL, cross-retrieval strategy), and has clear real-world clinical applications. Paper 1, while technically sound, addresses a narrower CS education problem (automated grading) with incremental improvements to existing transformer fine-tuning approaches, limiting its broader scientific impact.