Continual Model Routing in Evolving Model Hubs

Jack Bell, Giacomo Carfì, Gerlando Gramaglia, Vincenzo Lomonaco

May 27, 2026

arXiv:2605.28577v1 PDF

cs.AI(primary)cs.LG

#686of 2682·Artificial Intelligence

#686 of 2682 · Artificial Intelligence

Tournament Score

1461±48

10501800

69%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor7

Novelty6.5

Clarity7

Tournament Score

1461±48

10501800

69%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

AI model hubs provide access to a rapidly growing collection of powerful pre-trained models, enabling off-the-shelf mixture-of-experts systems with different routing strategies. However, this rapid growth poses two fundamental challenges: scaling model selection across thousands of experts and continually updating routing mechanisms as new models and tasks are introduced. In this paper, we formalise this setting as Continual Model Routing (CMR) and propose CMRBench, a new large-scale benchmark simulating realistic hub expansion and including over 2,000 candidate models. Finally, we introduce CARvE, a contrastive embedding approach for efficient continual model routing via checkpoint-based anchoring and structured replay. Extensive empirical results and ablations show that CARvE significantly outperforms zero-shot retrieval, fine-tuning, and adapter-merging baselines in model, family, and domain-level accuracy.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Continual Model Routing in Evolving Model Hubs

1. Core Contribution

This paper makes three intertwined contributions: (1) the formalization of Continual Model Routing (CMR) as a class-incremental learning problem where the label space (model IDs) expands over time; (2) CMRBench, a four-experience benchmark spanning 2,000+ candidate models built from APIBench, ToolMMBench, and a new HuggingBench dataset; and (3) CARvE, a contrastive embedding-based router that uses checkpoint-based anchoring and structured replay to continually adapt to an expanding model registry.

The key insight is that model routing in growing hubs should not be treated as a static retrieval or one-time classification problem, but rather as a continual learning challenge where new models arrive, the label space shifts, and the router must adapt without catastrophic forgetting. This reframing is well-motivated: model hubs like Hugging Face host millions of models, and any practical routing system must handle non-stationarity.

2. Methodological Rigor

The experimental design is thorough. The authors compare CARvE against a comprehensive set of baselines: retrieval-only (BM25, SentenceTransformers, SPLADE, BGE-M3), sequential fine-tuning, random replay at multiple budgets, model merging (TIES, DARE, averaging), regularization methods (EWC, LwF), joint training upper bounds, cumulative training, and HuggingGPT-style controllers. Three random seeds are used throughout, with standard errors reported.

The ablation study is well-structured, isolating contributions of: replay strategy (random vs. domain-stratified), anchoring mechanisms (embedding vs. projection), candidate set size K, backbone sensitivity (LLaMA2-7B, Qwen2.5-7B, Qwen3-4B), and label noise robustness. The finding that projection anchoring is more critical than embedding anchoring provides actionable architectural insight.

However, several methodological concerns warrant discussion:

Synthetic query generation: HuggingBench queries are LLM-generated via self-instruct, which may not capture the diversity and ambiguity of real user queries. The human evaluation (N=100, two annotators) shows reasonable but not exceptional quality (prompt naturalness mean = 3.61/5).

Ground-truth validity: The one-to-one instruction-model mapping assumption is restrictive. In practice, multiple models could validly serve a query, making exact model-ID accuracy somewhat artificial. The paper partially addresses this with family and domain accuracy metrics.

Scale claims vs. reality: While the paper claims relevance to "thousands" of models, the benchmark contains ~2,000 models with ~34,000 total samples. Real hubs host millions of models, and it remains unclear how CARvE would scale.

3. Potential Impact

Practical relevance: As model hubs continue to grow, automated routing becomes increasingly important. CARvE's embedding-based approach with O(Kd) per-example scoring and potential FAISS integration makes it deployable in latency-sensitive settings. The compute analysis showing 45-48% reduction vs. cumulative/from-scratch retraining strengthens the practical case.

Community infrastructure: CMRBench fills a genuine gap—prior routing benchmarks assumed static candidate pools. The temporal structuring across four experiences with realistic model overlap provides a reproducible evaluation substrate for future work.

Broader influence: The paper connects model routing to continual learning in a principled way, which could influence how the community thinks about maintaining AI infrastructure systems more generally. The connection to MoE architectures and system-level orchestration is well-drawn.

Limitations on impact: The approach requires supervised prompt-model pairs for each new model, creating a cold-start problem that the authors acknowledge but do not solve. This significantly limits deployment in truly open-ended hub settings where new models arrive continuously without curated routing examples.

4. Timeliness & Relevance

This work addresses a genuine and growing bottleneck. The proliferation of specialized models and the shift toward "scaling by specialization" make routing a first-class problem. The timing is appropriate—model hubs are at a scale where manual selection is infeasible, but automated routing infrastructure is underdeveloped.

The continual learning framing is particularly timely given recent community attention to maintaining foundation model systems over time rather than retraining from scratch.

5. Strengths & Limitations

Key Strengths:

Clean problem formalization bridging continual learning and model routing

Comprehensive experimental coverage with 15+ baselines and extensive ablations

The model family accuracy metric is a sensible intermediate granularity measure

Strong empirical results: CARvE at 10% replay achieves 80.7% D-Acc with 5.9% D-Fgt vs. 75.9% and 13.1% for standard replay

Compute efficiency analysis with concrete cost projections

Top-3 domain accuracy of 94.8% suggests practical viability with lightweight re-ranking

Notable Weaknesses:

Cold-start limitation: Cannot route to models without supervised examples, fundamentally limiting applicability in fast-evolving hubs

Benchmark construction concerns: Self-instruct-generated queries and potentially circular evaluation (the router is trained and tested on synthetic data from the same pipeline)

Limited real-world validation: No deployment study or evaluation with actual user queries

Model-ID accuracy remains low: Even the best CARvE configuration achieves only ~46% model-ID accuracy, and the gap to upper bounds (50.4% from-scratch) suggests fundamental limitations

Inconsistency in reported numbers: Table 2 reports CARvE family accuracy at 51.9% for 20% replay, but the discussion section mentions "51.9% vs. 59.8%"—the 59.8% figure doesn't appear in Table 2, suggesting a reporting error

LoRA-only backbone adaptation: The frozen backbone with LoRA adapters may become a bottleneck for truly out-of-distribution domains, as acknowledged by the authors

Overall Assessment

This paper makes a meaningful contribution by formalizing continual model routing, providing a benchmark, and demonstrating that continual learning techniques significantly improve routing stability. The experimental work is extensive and largely convincing. However, the cold-start limitation, reliance on synthetic data, and relatively modest absolute accuracy levels temper the practical significance. The work is best understood as establishing a research direction and evaluation framework rather than providing a deployment-ready solution.

Rating:6.5/ 10

Significance 6.5Rigor 7Novelty 6.5Clarity 7

Generated May 28, 2026

Comparison History (16)

vs. Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact due to its timely, security-critical framing (persistent, stateful “Sleeper Attacks” on LLM agents), broad relevance across agent frameworks, safety, and deployment contexts, and clear real-world implications for tool-using systems. It introduces a novel threat model beyond single-turn jailbreaks and provides a sizable benchmark spanning outcomes, strategies, and state targets, with evidence across multiple open/closed models. Paper 1 is valuable for scalable model routing and benchmarking, but its impact is narrower and more systems/ML-infra focused.

vs. REED: Post-Training Representation Editing for Cross-Domain Linguistic Steganalysis

gemini-3.15/28/2026

Paper 1 addresses a highly timely and broad challenge in modern AI: efficiently routing across rapidly expanding hubs of pre-trained models. By formalizing Continual Model Routing and introducing a large-scale benchmark (CMRBench), it provides foundational tools for the growing ecosystem of LLMs and MoE systems. In contrast, Paper 2 tackles linguistic steganalysis, which, while valuable for cybersecurity, represents a more niche application. Paper 1's broader relevance across various AI domains and its alignment with current trends in model scaling give it a significantly higher potential for widespread scientific impact.

vs. A Query Engine for the Agents

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact: it formalizes a broadly relevant problem (continual routing over expanding model hubs), contributes a large-scale benchmark (CMRBench with 2,000+ models) that can anchor community progress, and proposes a general routing method (CARvE) with empirical comparisons/ablations. Its applicability spans many tasks and model ecosystems, aligning with a timely shift toward model hubs and MoE-style selection. Paper 1 is highly practical and innovative for JS-native analytics with LLM UDFs, but its impact is more domain-specific (client-side data/agent tooling) and benchmark generality may be narrower.

vs. Generative AI and the Productivity Divide: Human-AI Complementarities in Education

gpt-5.25/28/2026

Paper 1 has higher likely scientific impact due to strong methodological novelty (formalizing Continual Model Routing, introducing a large-scale benchmark with >2,000 models, and proposing a new contrastive routing method) and broad applicability across ML systems that increasingly rely on model hubs and mixture-of-experts. Its contributions are reusable infrastructure (CMRBench) plus an algorithm (CARvE), enabling follow-on work and adoption in many domains. Paper 2 is timely and practically relevant, but its impact may be narrower to education/management contexts and depends on external validity beyond the experimental setting.

vs. Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems

claude-opus-4.65/28/2026

Paper 2 addresses a concrete, timely problem (scaling model selection in growing model hubs) with a well-defined benchmark (CMRBench with 2,000+ models) and a novel method (CARvE) backed by extensive empirical validation. It has immediate practical applicability as model hubs like HuggingFace continue to grow explosively. Paper 1, while intellectually interesting in formalizing managed autonomy for agentic AI, is primarily theoretical with limited empirical validation. Paper 2's benchmark contribution alone provides lasting infrastructure for the community, and its combination of formalization, benchmark, and method gives it broader and more immediate impact.

vs. Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

claude-opus-4.65/28/2026

Paper 2 addresses a fundamental question about LLM cognition—whether they build internal world models—with a rigorous multilingual benchmark and a striking empirical finding (the L3 reasoning cliff) that generalizes across languages, scales, and even to humans. This reframes spatial reasoning limitations as working-memory constraints rather than architectural deficits, which has broad implications for LLM design, multimodal AI, and cognitive science. Paper 1 tackles a practical but narrower infrastructure problem (model routing in hubs). While useful, its impact is more incremental and domain-specific compared to Paper 2's foundational insights.

vs. TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact due to broader applicability and timeliness: continual routing over rapidly expanding model hubs is a pressing, general problem affecting many tasks and deployment settings. It contributes a formal problem definition (CMR), a large-scale benchmark (CMRBench with 2,000+ models) that can anchor future work, and an efficient method (CARvE) with strong empirical comparisons. Paper 1 is innovative and rigorous for multi-LLM cooperation, but its impact is narrower (multi-agent reasoning/RL on specific reasoning benchmarks) and may be harder to generalize across modalities and hub-scale ecosystems.

vs. Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning

gpt-5.25/28/2026

Paper 2 likely has higher impact due to introducing a new problem formalization (Continual Model Routing), a large-scale benchmark (CMRBench with 2,000+ models) that can become a community standard, and a general routing method (CARvE) applicable across many tasks and model-hub settings. Its relevance is high given rapid growth of model repositories and MoE-style systems. Paper 1 is timely and practical (offline RL for code LLM post-training) but is narrower in scope (code generation post-training) and less likely to reshape broader workflows compared with a benchmark+framework that affects model selection infrastructure.

vs. Behavioural Analysis of Alignment Faking

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact due to strong timeliness and broad relevance: alignment faking is central to AI safety, governance, and deployment risk. It offers a clearer mechanistic decomposition (values, goal guarding, sycophancy) supported by controlled setups, ablations, and activation steering, making the findings actionable for detection/mitigation and generalizable across model scales. Paper 1 is methodologically solid and useful for model-hub engineering, but its impact is more domain-specific (routing/benchmarking) and less cross-cutting than safety-alignment insights.

vs. Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

gpt-5.25/28/2026

Paper 2 has higher likely impact: it addresses a timely, broadly relevant problem (faithful evaluation of VLM explainability) and identifies a fundamental failure mode in current metrics. Its contribution—a theoretically grounded, scalable cross-modal synergy metric (Shapley/Harsanyi-based) with strong empirical validation across models, methods, and datasets—can reshape how multimodal XAI is assessed and audited, with clear implications for safety-critical deployment. Paper 1 is novel and useful for model hubs, but its impact is more niche to routing/benchmarking in evolving expert pools and may depend on adoption of specific hub paradigms.

vs. Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

claude-opus-4.65/28/2026

Paper 1 addresses a highly practical and timely problem—scaling model selection and routing in growing AI model hubs—with a concrete benchmark (CMRBench with 2,000+ models) and a novel method (CARvE). This has broad real-world applicability as model hubs like HuggingFace continue to expand. It formalizes a new problem setting (Continual Model Routing) that could catalyze an entire research direction. Paper 2 provides interesting mechanistic insights about depth utilization in agentic LLMs, but its findings are more observational/analytical and less likely to directly influence system design or spawn new subfields.

vs. AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

gpt-5.25/28/2026

Paper 2 likely has higher impact: it formalizes a broadly relevant new problem setting (Continual Model Routing) aligned with the rapid expansion of public model hubs, introduces a large-scale benchmark (CMRBench, >2,000 models) that can standardize evaluation across the community, and proposes a scalable method (CARvE) addressing continual updates—key for real-world deployment. Its applicability spans retrieval, MoE systems, MLOps, and model governance. Paper 1 is novel and useful for LLM agents, but is more specialized to prompt compression and agent action formatting, with narrower cross-field reach.

vs. SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

claude-opus-4.65/28/2026

Paper 1 addresses a broader and more fundamental problem—scaling model selection and routing in growing model hubs—which affects the entire AI ecosystem. It introduces both a formalized problem setting (CMR) and a large-scale benchmark (CMRBench with 2000+ models), providing lasting infrastructure for the community. Paper 2, while technically sound with its contrastive credit assignment for skill internalization, addresses a narrower problem in agentic RL with incremental improvements (5.5% and 4.4%) over baselines on two benchmarks. Paper 1's benchmark contribution and the growing relevance of model hubs give it higher potential for broad impact.

vs. Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

claude-opus-4.65/28/2026

Paper 2 addresses a more fundamental and timely question about LLM reasoning efficiency—understanding how compressed chain-of-thought data affects post-training. Its systematic taxonomy (Explicit, Composed, Implicit CoT) and findings about SFT vs. RL dynamics have broad implications for the entire LLM training community. The insights about data scaling, memorization risks, and how RL decompresses compressed reasoning steps are novel and actionable. Paper 1, while addressing a practical problem in model routing, targets a narrower audience and a more incremental infrastructure challenge with less fundamental scientific contribution.

vs. MIRA: A Bilingual Benchmark for Medical Information Response Audit

gemini-3.15/28/2026

Paper 1 addresses a fundamental and highly scalable infrastructure problem in AI—routing across thousands of evolving models in expanding hubs. Its methodological contributions (a large-scale benchmark and a novel continual routing method) have broad applicability across all domains of machine learning. While Paper 2 offers an important contribution to medical AI safety and fairness, Paper 1's generalizable approach to model selection and mixture-of-experts systems will likely drive wider adoption and impact across the broader AI and systems research communities.

vs. LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

claude-opus-4.65/28/2026

Paper 2 introduces a novel problem formalization (Continual Model Routing) addressing a fundamental and growing infrastructure challenge in AI—how to efficiently route among thousands of evolving pre-trained models. This has broad applicability across all AI domains and introduces both a benchmark and a method (CARvE). Paper 1, while well-executed, is primarily an evaluation benchmark for LMMs in K-12 education—a more narrowly scoped contribution. Paper 2's problem will grow in importance as model hubs expand, giving it higher long-term impact potential and broader cross-field relevance.