Continual Model Routing in Evolving Model Hubs
Jack Bell, Giacomo Carfì, Gerlando Gramaglia, Vincenzo Lomonaco
Abstract
AI model hubs provide access to a rapidly growing collection of powerful pre-trained models, enabling off-the-shelf mixture-of-experts systems with different routing strategies. However, this rapid growth poses two fundamental challenges: scaling model selection across thousands of experts and continually updating routing mechanisms as new models and tasks are introduced. In this paper, we formalise this setting as Continual Model Routing (CMR) and propose CMRBench, a new large-scale benchmark simulating realistic hub expansion and including over 2,000 candidate models. Finally, we introduce CARvE, a contrastive embedding approach for efficient continual model routing via checkpoint-based anchoring and structured replay. Extensive empirical results and ablations show that CARvE significantly outperforms zero-shot retrieval, fine-tuning, and adapter-merging baselines in model, family, and domain-level accuracy.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Continual Model Routing in Evolving Model Hubs
1. Core Contribution
This paper makes three intertwined contributions: (1) the formalization of Continual Model Routing (CMR) as a class-incremental learning problem where the label space (model IDs) expands over time; (2) CMRBench, a four-experience benchmark spanning 2,000+ candidate models built from APIBench, ToolMMBench, and a new HuggingBench dataset; and (3) CARvE, a contrastive embedding-based router that uses checkpoint-based anchoring and structured replay to continually adapt to an expanding model registry.
The key insight is that model routing in growing hubs should not be treated as a static retrieval or one-time classification problem, but rather as a continual learning challenge where new models arrive, the label space shifts, and the router must adapt without catastrophic forgetting. This reframing is well-motivated: model hubs like Hugging Face host millions of models, and any practical routing system must handle non-stationarity.
2. Methodological Rigor
The experimental design is thorough. The authors compare CARvE against a comprehensive set of baselines: retrieval-only (BM25, SentenceTransformers, SPLADE, BGE-M3), sequential fine-tuning, random replay at multiple budgets, model merging (TIES, DARE, averaging), regularization methods (EWC, LwF), joint training upper bounds, cumulative training, and HuggingGPT-style controllers. Three random seeds are used throughout, with standard errors reported.
The ablation study is well-structured, isolating contributions of: replay strategy (random vs. domain-stratified), anchoring mechanisms (embedding vs. projection), candidate set size K, backbone sensitivity (LLaMA2-7B, Qwen2.5-7B, Qwen3-4B), and label noise robustness. The finding that projection anchoring is more critical than embedding anchoring provides actionable architectural insight.
However, several methodological concerns warrant discussion:
3. Potential Impact
Practical relevance: As model hubs continue to grow, automated routing becomes increasingly important. CARvE's embedding-based approach with O(Kd) per-example scoring and potential FAISS integration makes it deployable in latency-sensitive settings. The compute analysis showing 45-48% reduction vs. cumulative/from-scratch retraining strengthens the practical case.
Community infrastructure: CMRBench fills a genuine gap—prior routing benchmarks assumed static candidate pools. The temporal structuring across four experiences with realistic model overlap provides a reproducible evaluation substrate for future work.
Broader influence: The paper connects model routing to continual learning in a principled way, which could influence how the community thinks about maintaining AI infrastructure systems more generally. The connection to MoE architectures and system-level orchestration is well-drawn.
Limitations on impact: The approach requires supervised prompt-model pairs for each new model, creating a cold-start problem that the authors acknowledge but do not solve. This significantly limits deployment in truly open-ended hub settings where new models arrive continuously without curated routing examples.
4. Timeliness & Relevance
This work addresses a genuine and growing bottleneck. The proliferation of specialized models and the shift toward "scaling by specialization" make routing a first-class problem. The timing is appropriate—model hubs are at a scale where manual selection is infeasible, but automated routing infrastructure is underdeveloped.
The continual learning framing is particularly timely given recent community attention to maintaining foundation model systems over time rather than retraining from scratch.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
This paper makes a meaningful contribution by formalizing continual model routing, providing a benchmark, and demonstrating that continual learning techniques significantly improve routing stability. The experimental work is extensive and largely convincing. However, the cold-start limitation, reliance on synthetic data, and relatively modest absolute accuracy levels temper the practical significance. The work is best understood as establishing a research direction and evaluation framework rather than providing a deployment-ready solution.
Generated May 28, 2026
Comparison History (16)
Paper 2 likely has higher scientific impact due to its timely, security-critical framing (persistent, stateful “Sleeper Attacks” on LLM agents), broad relevance across agent frameworks, safety, and deployment contexts, and clear real-world implications for tool-using systems. It introduces a novel threat model beyond single-turn jailbreaks and provides a sizable benchmark spanning outcomes, strategies, and state targets, with evidence across multiple open/closed models. Paper 1 is valuable for scalable model routing and benchmarking, but its impact is narrower and more systems/ML-infra focused.
Paper 1 addresses a highly timely and broad challenge in modern AI: efficiently routing across rapidly expanding hubs of pre-trained models. By formalizing Continual Model Routing and introducing a large-scale benchmark (CMRBench), it provides foundational tools for the growing ecosystem of LLMs and MoE systems. In contrast, Paper 2 tackles linguistic steganalysis, which, while valuable for cybersecurity, represents a more niche application. Paper 1's broader relevance across various AI domains and its alignment with current trends in model scaling give it a significantly higher potential for widespread scientific impact.
Paper 2 likely has higher scientific impact: it formalizes a broadly relevant problem (continual routing over expanding model hubs), contributes a large-scale benchmark (CMRBench with 2,000+ models) that can anchor community progress, and proposes a general routing method (CARvE) with empirical comparisons/ablations. Its applicability spans many tasks and model ecosystems, aligning with a timely shift toward model hubs and MoE-style selection. Paper 1 is highly practical and innovative for JS-native analytics with LLM UDFs, but its impact is more domain-specific (client-side data/agent tooling) and benchmark generality may be narrower.
Paper 1 has higher likely scientific impact due to strong methodological novelty (formalizing Continual Model Routing, introducing a large-scale benchmark with >2,000 models, and proposing a new contrastive routing method) and broad applicability across ML systems that increasingly rely on model hubs and mixture-of-experts. Its contributions are reusable infrastructure (CMRBench) plus an algorithm (CARvE), enabling follow-on work and adoption in many domains. Paper 2 is timely and practically relevant, but its impact may be narrower to education/management contexts and depends on external validity beyond the experimental setting.
Paper 2 addresses a concrete, timely problem (scaling model selection in growing model hubs) with a well-defined benchmark (CMRBench with 2,000+ models) and a novel method (CARvE) backed by extensive empirical validation. It has immediate practical applicability as model hubs like HuggingFace continue to grow explosively. Paper 1, while intellectually interesting in formalizing managed autonomy for agentic AI, is primarily theoretical with limited empirical validation. Paper 2's benchmark contribution alone provides lasting infrastructure for the community, and its combination of formalization, benchmark, and method gives it broader and more immediate impact.
Paper 2 addresses a fundamental question about LLM cognition—whether they build internal world models—with a rigorous multilingual benchmark and a striking empirical finding (the L3 reasoning cliff) that generalizes across languages, scales, and even to humans. This reframes spatial reasoning limitations as working-memory constraints rather than architectural deficits, which has broad implications for LLM design, multimodal AI, and cognitive science. Paper 1 tackles a practical but narrower infrastructure problem (model routing in hubs). While useful, its impact is more incremental and domain-specific compared to Paper 2's foundational insights.
Paper 2 likely has higher scientific impact due to broader applicability and timeliness: continual routing over rapidly expanding model hubs is a pressing, general problem affecting many tasks and deployment settings. It contributes a formal problem definition (CMR), a large-scale benchmark (CMRBench with 2,000+ models) that can anchor future work, and an efficient method (CARvE) with strong empirical comparisons. Paper 1 is innovative and rigorous for multi-LLM cooperation, but its impact is narrower (multi-agent reasoning/RL on specific reasoning benchmarks) and may be harder to generalize across modalities and hub-scale ecosystems.
Paper 2 likely has higher impact due to introducing a new problem formalization (Continual Model Routing), a large-scale benchmark (CMRBench with 2,000+ models) that can become a community standard, and a general routing method (CARvE) applicable across many tasks and model-hub settings. Its relevance is high given rapid growth of model repositories and MoE-style systems. Paper 1 is timely and practical (offline RL for code LLM post-training) but is narrower in scope (code generation post-training) and less likely to reshape broader workflows compared with a benchmark+framework that affects model selection infrastructure.
Paper 2 likely has higher scientific impact due to strong timeliness and broad relevance: alignment faking is central to AI safety, governance, and deployment risk. It offers a clearer mechanistic decomposition (values, goal guarding, sycophancy) supported by controlled setups, ablations, and activation steering, making the findings actionable for detection/mitigation and generalizable across model scales. Paper 1 is methodologically solid and useful for model-hub engineering, but its impact is more domain-specific (routing/benchmarking) and less cross-cutting than safety-alignment insights.
Paper 2 has higher likely impact: it addresses a timely, broadly relevant problem (faithful evaluation of VLM explainability) and identifies a fundamental failure mode in current metrics. Its contribution—a theoretically grounded, scalable cross-modal synergy metric (Shapley/Harsanyi-based) with strong empirical validation across models, methods, and datasets—can reshape how multimodal XAI is assessed and audited, with clear implications for safety-critical deployment. Paper 1 is novel and useful for model hubs, but its impact is more niche to routing/benchmarking in evolving expert pools and may depend on adoption of specific hub paradigms.
Paper 1 addresses a highly practical and timely problem—scaling model selection and routing in growing AI model hubs—with a concrete benchmark (CMRBench with 2,000+ models) and a novel method (CARvE). This has broad real-world applicability as model hubs like HuggingFace continue to expand. It formalizes a new problem setting (Continual Model Routing) that could catalyze an entire research direction. Paper 2 provides interesting mechanistic insights about depth utilization in agentic LLMs, but its findings are more observational/analytical and less likely to directly influence system design or spawn new subfields.
Paper 2 likely has higher impact: it formalizes a broadly relevant new problem setting (Continual Model Routing) aligned with the rapid expansion of public model hubs, introduces a large-scale benchmark (CMRBench, >2,000 models) that can standardize evaluation across the community, and proposes a scalable method (CARvE) addressing continual updates—key for real-world deployment. Its applicability spans retrieval, MoE systems, MLOps, and model governance. Paper 1 is novel and useful for LLM agents, but is more specialized to prompt compression and agent action formatting, with narrower cross-field reach.
Paper 1 addresses a broader and more fundamental problem—scaling model selection and routing in growing model hubs—which affects the entire AI ecosystem. It introduces both a formalized problem setting (CMR) and a large-scale benchmark (CMRBench with 2000+ models), providing lasting infrastructure for the community. Paper 2, while technically sound with its contrastive credit assignment for skill internalization, addresses a narrower problem in agentic RL with incremental improvements (5.5% and 4.4%) over baselines on two benchmarks. Paper 1's benchmark contribution and the growing relevance of model hubs give it higher potential for broad impact.
Paper 2 addresses a more fundamental and timely question about LLM reasoning efficiency—understanding how compressed chain-of-thought data affects post-training. Its systematic taxonomy (Explicit, Composed, Implicit CoT) and findings about SFT vs. RL dynamics have broad implications for the entire LLM training community. The insights about data scaling, memorization risks, and how RL decompresses compressed reasoning steps are novel and actionable. Paper 1, while addressing a practical problem in model routing, targets a narrower audience and a more incremental infrastructure challenge with less fundamental scientific contribution.
Paper 1 addresses a fundamental and highly scalable infrastructure problem in AI—routing across thousands of evolving models in expanding hubs. Its methodological contributions (a large-scale benchmark and a novel continual routing method) have broad applicability across all domains of machine learning. While Paper 2 offers an important contribution to medical AI safety and fairness, Paper 1's generalizable approach to model selection and mixture-of-experts systems will likely drive wider adoption and impact across the broader AI and systems research communities.
Paper 2 introduces a novel problem formalization (Continual Model Routing) addressing a fundamental and growing infrastructure challenge in AI—how to efficiently route among thousands of evolving pre-trained models. This has broad applicability across all AI domains and introduces both a benchmark and a method (CARvE). Paper 1, while well-executed, is primarily an evaluation benchmark for LMMs in K-12 education—a more narrowly scoped contribution. Paper 2's problem will grow in importance as model hubs expand, giving it higher long-term impact potential and broader cross-field relevance.