Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline
Wenxuan Song, Jiayi Chen, Xiaoquan Sun, Huashuo Lei, Yikai Qin, Wei Zhao, Pengxiang Ding, Han Zhao
Abstract
Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LLaVA-VLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it integrates a compact VLM backbone with multi-view perception, proprioceptive tokenization, and action chunking. To eliminate reliance on costly pre-training, LLaVA-VLA adopts a two-stage training paradigm including post-training and fine-tuning. Furthermore, LLaVA-VLA extends the action space to unify navigation and manipulation. Experiments across embodiments demonstrate the capabilities of generalization and versatility of LLaVA-VLA , while real-world mobile manipulation experiments establish it as the first end-to-end VLA model for mobile manipulation. We will open-source all datasets, codes, and checkpoints upon acceptance to foster reproducibility and future research.
AI Impact Assessments
(3 models)Scientific Impact Assessment
1. Core Contribution
This paper addresses three practical limitations of Vision-Language-Action (VLA) models: excessive parameter counts, costly pre-training requirements, and limited cross-embodiment applicability (particularly for mobile manipulation). The contributions are twofold:
CEBench: A cross-embodiment benchmark spanning single-arm (CALVIN), bimanual (RoboTwin), and bimanual mobile manipulation (real-world Cobot-Magic), with 14.4k simulated and 1.6k real-world trajectories across 44 tasks. The benchmark explicitly incorporates domain randomization (DR) settings for evaluating visual generalization.
LLaVA-VLA: A 0.5B-parameter VLA built on LLaVA-OneVision-0.5B that integrates multi-view perception (concatenated first/third-person images), proprioceptive tokenization, action chunking, and a hybrid navigation-manipulation action space using direction + value tokens. It uses a two-stage training paradigm (post-training on multi-task data, then fine-tuning) rather than large-scale pre-training.
2. Methodological Rigor
The paper's systematic ablation study organized around three research questions (Q1-Q3) is one of its stronger methodological elements. Each design choice—multi-view integration, proprioception encoding, action chunking, training curriculum, and action space design—is ablated individually with results reported on established benchmarks. This provides clear justification for architectural decisions.
However, there are notable methodological concerns:
3. Potential Impact
The paper addresses a genuinely important practical bottleneck—making VLAs deployable on consumer-grade hardware. The finding that a 0.5B model can compete with 7B models (F1) and that large-scale pre-training is not essential (F5) could meaningfully lower barriers to entry for robotics researchers.
The hybrid action space for navigation + manipulation (F8) is a practical contribution, though the current results are preliminary. If refined, this could enable a new class of mobile manipulation VLAs.
The promised open-source release of datasets, code, and checkpoints would amplify impact significantly, particularly CEBench as a standardized evaluation suite for practical VLA research.
However, several factors limit broader impact:
4. Timeliness & Relevance
This paper is highly timely. The VLA field is rapidly expanding with models like π0, GR00T N1, and OpenVLA, but practical deployment remains elusive due to computational demands. The push toward lightweight, pretraining-free VLAs directly addresses an emerging community need. The focus on mobile manipulation—bridging navigation and manipulation in a single model—is also aligned with growing interest in mobile embodied AI (Mobile ALOHA, etc.).
The domain randomization evaluation is particularly relevant, as real-world deployment inevitably involves distribution shifts that many VLAs handle poorly.
5. Strengths & Limitations
Strengths:
Limitations:
Overall Assessment
This is a solid engineering and empirical contribution that addresses a real need in the VLA community. The systematic study of design choices for lightweight VLAs provides useful guidance, and the CALVIN results are genuinely impressive for a 0.5B model. However, the novelty is primarily in integration rather than individual techniques, the mobile manipulation results are preliminary, and several experimental aspects lack the depth needed for strong conclusions. The paper would benefit from more rigorous statistical analysis, stronger baselines, and deeper mobile manipulation evaluation.
Generated Apr 19, 2026
Comparison History (46)
RotVLA introduces a fundamentally novel representation—rotational latent actions on SO(n)—that provides a principled mathematical framework for cross-embodiment action representation with continuity, compositionality, and geometric structure. This theoretical contribution is more innovative than LLaVA-VLA's engineering-focused improvements. RotVLA also achieves strong empirical results (98.2% LIBERO, strong RoboTwin2.0) with a compact 1.7B model, demonstrating that the representation itself drives performance. The idea of using rotation groups for latent actions is likely to inspire follow-up work across robotics and representation learning, giving it broader potential impact.
Paper 2 addresses broader and more practical challenges in VLA models—scalability, deployment on consumer GPUs, cross-embodiment generalization, and mobile manipulation. It contributes a comprehensive benchmark (CEBench) with both simulated and real-world data, introduces a lightweight deployable VLA (LLaVA-VLA), and promises full open-source release. Its breadth of impact across diverse embodiments, practical deployment considerations, and benchmark contribution give it higher potential to influence the field. Paper 1, while technically sound with strong LIBERO results, is narrower in scope, focusing on a specific planning framework for manipulation.
Paper 1 likely has higher impact due to broader novelty and utility: it introduces a substantial new cross-embodiment benchmark (sim+real, domain randomization) with sizable datasets and an improved lightweight baseline enabling practical deployment and unified navigation+manipulation, including real-world mobile manipulation results and planned open-sourcing. This can become a community standard for training/evaluating VLAs, influencing robotics, multimodal learning, and benchmarks. Paper 2 is rigorous and timely for reliability, but is narrower (testing methodology) and likely more incremental, with impact mainly in VLA verification rather than core capability advances.
Paper 1 likely has higher scientific impact due to broader and more foundational contributions: a new cross-embodiment benchmark (simulation+real) plus a lightweight end-to-end VLA baseline aimed at democratizing training/deployment on consumer GPUs, with open-source commitments that can catalyze community adoption. Its potential applications span generalist robotics (manipulation and navigation+manipulation) and can influence evaluation norms. Paper 2 is timely and clever but more narrow (inference acceleration for AR VLAs), depends on environment-specific retrieval databases/tuning, and relaxes SD guarantees with measurable success-rate drops, limiting generality.
Paper 2 is likely to have higher scientific impact because it exposes a broadly relevant, safety-critical vulnerability in LiDAR SLAM in realistic feature-rich settings and pairs it with a deployable defense using standard onboard sensors. The attack is principled (scan-matching–guided) and evaluated across multiple SLAM systems with strong gains in success rate, giving it cross-system relevance. Paper 1 is timely and useful but more incremental (benchmark + lightweight baseline), with weaker real-world evidence and missing key contemporary baselines, limiting field-shaping impact.
Paper 2 introduces a novel, principled memory representation (SSMG) with strong empirical results, notably doubling the Success weighted by Path Length (SPL) over baselines. In contrast, Paper 1 offers a more incremental contribution with notable methodological flaws, including missing key baselines and weak evidence for its mobile manipulation claims. Paper 2's rigorous approach to lifelong navigation provides a more significant and reliable advance for the embodied AI community.
Paper 1 has higher likely scientific impact due to broader scope and community leverage: it introduces a cross-embodiment benchmark (sim+real) plus an open-source lightweight VLA baseline that targets a widely active, fast-growing VLA ecosystem. If adopted, CEBench and LLaVA-VLA can standardize evaluation and accelerate many downstream robotics/VLM works across manipulation and mobile manipulation. Paper 2 is strong for quadruped navigation, but is narrower in domain and its headline “minute-level training” and safety claims are supported by limited real-world trials, reducing expected cross-field and long-term impact.
Paper 1 (SACA) introduces a conceptually novel framework for extracting dense supervision from failed trajectories via structural decomposition, achieving substantial state-of-the-art improvements (7.5-11.7% SR gains). Its core insight—zero-shot trajectory auditing for credit assignment—generalizes beyond VLN-CE to broader long-horizon decision-making problems. Paper 2 (LLaVA-VLA) makes practical but largely incremental contributions, combining existing techniques (multi-view, action chunking, proprioception tokenization) with weak mobile manipulation evidence (10 trials, 40% success). Paper 1 demonstrates stronger novelty, larger performance margins, and broader methodological applicability.
Paper 1 addresses a broader and more impactful problem space (practical VLA deployment across diverse embodiments) with stronger methodological contributions including a comprehensive benchmark, systematic ablation studies, and novel architectural integration. Despite limitations in mobile manipulation evaluation, its open-source commitment, strong CALVIN results, and relevance to the rapidly growing VLA community give it higher potential impact. Paper 2, while timely and practically useful, is primarily an engineering/systems contribution with limited algorithmic novelty, a single baseline comparison, and no statistical analysis.
Paper 1 offers substantially more rigorous evaluation across multiple benchmarks (CALVIN, RoboTwin, CEBench), diverse embodiments, and real-world experiments. It provides systematic ablations, a reusable benchmark with 16k trajectories, strong CALVIN results with a 0.5B model, and commits to open-sourcing all artifacts. While incremental, it addresses a genuine practical bottleneck (lightweight VLA deployment). Paper 2 is a preliminary proof-of-concept with a single 3-room environment, no baselines, no standard metrics, 5 repetitions per condition, and claims that far exceed the evidence presented.
Paper 1 addresses a major bottleneck in a rapidly exploding field (Robotics/AI): the extreme computational cost of Vision-Language-Action models. By enabling VLA training on consumer GPUs and demonstrating performance comparable to 7B models with only 0.5B parameters, it democratizes access for many researchers. While Paper 2 solves a concrete clinical problem with elegant methods, its impact is limited by an extremely small sample size (n=1 for the main algorithm) and niche application domain compared to the broad applicability of generalist robot foundation models in Paper 1.
Paper 2 has higher likely scientific impact because it targets a rapidly expanding, high-visibility area (vision-language-action robotics) and contributes an adoptable benchmark plus a lightweight baseline aimed at democratizing VLA deployment on consumer GPUs. If released as promised, the dataset/code can catalyze broad follow-on work across ML and robotics. Despite methodological gaps (limited real-world/mobile evidence, missing contemporary baselines), its potential reach and timeliness exceed Paper 1’s primarily infrastructural contribution, whose impact is valuable but more localized to labs able to build/use similar facilities.
Paper 2 addresses the highly active and rapidly growing VLA/robotics foundation model field with a benchmark and lightweight baseline that democratizes access to VLA training on consumer GPUs. Despite methodological concerns, its open-source commitment, systematic ablations, and timeliness within a crowded research front give it broader community impact. Paper 1 makes solid but narrower contributions to V2X deployment strategy. Paper 2's breadth across embodiments, its benchmark potential for community adoption, and relevance to the booming foundation-model-for-robotics wave give it higher estimated impact.
While Paper 1 addresses an important niche (fault tolerance in humanoids), Paper 2 targets the extremely hot and broad topic of Vision-Language-Action (VLA) models. Paper 2's focus on democratizing VLAs (running on consumer GPUs) and eliminating massive pre-training addresses a major bottleneck in the field right now. Its systematic benchmarking across diverse embodiments and open-sourcing of code/data for a lightweight, effective model is likely to see wider adoption and citation by the broader robotics and AI community compared to the more specialized application of Paper 1.
EgoAVFlow addresses a genuinely novel and underexplored problem—decoupling active vision from human viewpoint imitation for robot learning from egocentric videos—with a principled and elegant solution using 3D flow as a unified representation. Its conceptual contribution (visibility-aware reward with test-time denoising refinement) is more original and could influence multiple fields. Paper 2, while practically useful, makes largely incremental contributions by combining existing techniques (action chunking, multi-view, proprioception tokenization) into a lightweight VLA, with weak mobile manipulation evidence and missing key baselines. Paper 1's problem identification and technical novelty are stronger drivers of scientific impact.
Paper 1 addresses a rapidly growing field (VLA models) with broad community interest, offers open-source artifacts (benchmark, code, checkpoints) that enable direct adoption, and targets a practical bottleneck (consumer-grade GPU deployment) affecting many researchers. Despite methodological limitations, its timeliness in a hot area, cross-embodiment benchmark, and engineering contributions will likely attract more citations and downstream usage. Paper 2 makes a conceptually elegant but narrower contribution to AV ethics with significant simplifying assumptions (non-interactive agents, uniform uncertainty scaling) that limit near-term practical adoption.
Paper 1 addresses a rapidly expanding and highly cited field (Vision-Language-Action models/Robotics). It democratizes research by enabling consumer-grade GPU training, introduces a benchmark (CEBench), and releases open-source code/models, which typically drives high citation counts and community adoption. While Paper 2 is methodologically rigorous and provides solid engineering for a specific control problem, its impact is narrower, limited to the specialized niche of autonomous helicopter landing, and relies on simulation-only validation without the broad applicability or open-source ecosystem of Paper 1.
Paper 2 has higher likely scientific impact due to broader, faster-moving relevance and community reuse: it introduces a cross-embodiment benchmark plus an open-source lightweight VLA baseline targeting a key bottleneck (compute/pretraining), which can influence many labs and downstream applications across robotics and ML. Although empirical support for mobile manipulation is thin and some baselines are missing, benchmarks/datasets and practical training recipes often become widely adopted. Paper 1 is methodologically rigorous and clinically meaningful but is narrow in scope and early-stage (mannequin-only), limiting near-term uptake and cross-field impact.
Paper 2 addresses a broader and more timely problem—making VLA models practical for consumer-grade deployment across diverse embodiments—with higher potential to influence the rapidly growing VLA community. Its benchmark (CEBench), systematic ablations, open-source commitment, and strong CALVIN results with a 0.5B model provide actionable insights for a wide audience. While both papers have methodological limitations, Paper 2's scope spans multiple embodiments and tasks, bridges manipulation and navigation, and targets a critical scalability bottleneck. Paper 1, though well-designed, addresses a narrower problem (crowd navigation) with more incremental contributions.
Paper 1 (UPS) introduces a fundamentally novel theoretical framework for safe robot deployment by combining conformal prediction with VLM reasoning. Its methodological rigor, providing statistical guarantees for when a robot should ask for help versus act, addresses a critical bottleneck in trust and reliability for embodied AI. The Bayesian intent factorization is a sophisticated contribution to VLM calibration. In contrast, Paper 2 is largely an engineering consolidation—combining existing techniques (action chunking, multi-view perception) to create a lightweight model. While practical, it lacks the theoretical depth and novel algorithmic contribution of Paper 1.