Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

Wenxuan Song, Jiayi Chen, Xiaoquan Sun, Huashuo Lei, Yikai Qin, Wei Zhao, Pengxiang Ding, Han Zhao

Feb 26, 2026

arXiv:2602.22663v1 PDF

cs.RO(primary)

#515of 3245·Robotics

#515 of 3245 · Robotics

Tournament Score

1492±32

10001800

61%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor5.5

Novelty5

Clarity7

Tournament Score

1492±32

10001800

61%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LLaVA-VLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it integrates a compact VLM backbone with multi-view perception, proprioceptive tokenization, and action chunking. To eliminate reliance on costly pre-training, LLaVA-VLA adopts a two-stage training paradigm including post-training and fine-tuning. Furthermore, LLaVA-VLA extends the action space to unify navigation and manipulation. Experiments across embodiments demonstrate the capabilities of generalization and versatility of LLaVA-VLA , while real-world mobile manipulation experiments establish it as the first end-to-end VLA model for mobile manipulation. We will open-source all datasets, codes, and checkpoints upon acceptance to foster reproducibility and future research.

AI Impact Assessments

(3 models)

Scientific Impact Assessment

1. Core Contribution

This paper addresses three practical limitations of Vision-Language-Action (VLA) models: excessive parameter counts, costly pre-training requirements, and limited cross-embodiment applicability (particularly for mobile manipulation). The contributions are twofold:

CEBench: A cross-embodiment benchmark spanning single-arm (CALVIN), bimanual (RoboTwin), and bimanual mobile manipulation (real-world Cobot-Magic), with 14.4k simulated and 1.6k real-world trajectories across 44 tasks. The benchmark explicitly incorporates domain randomization (DR) settings for evaluating visual generalization.

LLaVA-VLA: A 0.5B-parameter VLA built on LLaVA-OneVision-0.5B that integrates multi-view perception (concatenated first/third-person images), proprioceptive tokenization, action chunking, and a hybrid navigation-manipulation action space using direction + value tokens. It uses a two-stage training paradigm (post-training on multi-task data, then fine-tuning) rather than large-scale pre-training.

2. Methodological Rigor

The paper's systematic ablation study organized around three research questions (Q1-Q3) is one of its stronger methodological elements. Each design choice—multi-view integration, proprioception encoding, action chunking, training curriculum, and action space design—is ablated individually with results reported on established benchmarks. This provides clear justification for architectural decisions.

However, there are notable methodological concerns:

Statistical reporting is incomplete. No confidence intervals or standard deviations are reported for any experiment, despite stochastic evaluation. The CALVIN evaluation uses 1000 trials (good), but RoboTwin uses only 100 trials and real-world uses as few as 10 trials for mobile manipulation—very few for drawing reliable conclusions.

Baseline comparisons are somewhat uneven. On RoboTwin, LLaVA-VLA (40.3% seen) underperforms RDT (48.9% seen) significantly, yet the paper emphasizes the DR setting where LLaVA-VLA leads (28.6% vs 11.4%). While DR robustness is valuable, the seen-setting gap is not adequately discussed.

The claim of being "the first end-to-end VLA for mobile manipulation" is evaluated on only 2 tasks with 10 episodes each, with success rates of 4/10 and 4/10. This is insufficient evidence for such a strong claim.

The CALVIN comparison is favorable but somewhat selective in baselines. Recent strong methods (e.g., SuSIE, Octo) are absent from comparison.

3. Potential Impact

The paper addresses a genuinely important practical bottleneck—making VLAs deployable on consumer-grade hardware. The finding that a 0.5B model can compete with 7B models (F1) and that large-scale pre-training is not essential (F5) could meaningfully lower barriers to entry for robotics researchers.

The hybrid action space for navigation + manipulation (F8) is a practical contribution, though the current results are preliminary. If refined, this could enable a new class of mobile manipulation VLAs.

The promised open-source release of datasets, code, and checkpoints would amplify impact significantly, particularly CEBench as a standardized evaluation suite for practical VLA research.

However, several factors limit broader impact:

The benchmark, while spanning multiple embodiments, is relatively small (14.4k sim + 1.6k real trajectories) compared to datasets like Open X-Embodiment.

The mobile manipulation capabilities are nascent—4/10 success rates are far from practical deployment.

Many findings (e.g., multi-view helps, action chunking helps, proprioception helps) are individually well-known; the novelty lies more in their integration and systematic validation in the lightweight regime.

4. Timeliness & Relevance

This paper is highly timely. The VLA field is rapidly expanding with models like π0, GR00T N1, and OpenVLA, but practical deployment remains elusive due to computational demands. The push toward lightweight, pretraining-free VLAs directly addresses an emerging community need. The focus on mobile manipulation—bridging navigation and manipulation in a single model—is also aligned with growing interest in mobile embodied AI (Mobile ALOHA, etc.).

The domain randomization evaluation is particularly relevant, as real-world deployment inevitably involves distribution shifts that many VLAs handle poorly.

5. Strengths & Limitations

Strengths:

Systematic design study: The Q1-Q3 framework with clear findings (F1-F8) provides an organized investigation that readers can build upon.

Practical efficiency: Post-training on 8 H100s in ~6 hours and fine-tuning on a single 4090 is genuinely practical.

Strong CALVIN results: 3.68 average length with 0.5B parameters surpasses several 3-7B models, demonstrating the viability of compact VLAs.

Domain randomization evaluation: Explicitly testing visual generalization is important and often overlooked.

Cross-embodiment scope: Spanning single-arm, bimanual, and mobile platforms in one framework.

Limitations:

Shallow mobile manipulation validation: Only 2 tasks, 10 trials each, with 40% success rates. This weakens the "first mobile manipulation VLA" claim significantly.

No error analysis or failure mode discussion: Understanding when and why the model fails would strengthen the contribution.

Missing important baselines: No comparison with SmolVLA, NORA, or MiniVLA—models explicitly designed for lightweight VLA—despite discussing them in related work.

Benchmark novelty is moderate: CEBench largely wraps existing benchmarks (CALVIN, RoboTwin) with a real-world addition, rather than introducing fundamentally new evaluation paradigms.

Limited investigation of scaling laws: The paper shows 0.5B vs 7B but doesn't explore intermediate scales or provide deeper analysis of what drives the small model's competitiveness.

Some findings lack depth: For example, F5 (pre-training not essential) is demonstrated with only two tasks (Table V), and the pre-training baseline details are unclear.

Overall Assessment

This is a solid engineering and empirical contribution that addresses a real need in the VLA community. The systematic study of design choices for lightweight VLAs provides useful guidance, and the CALVIN results are genuinely impressive for a 0.5B model. However, the novelty is primarily in integration rather than individual techniques, the mobile manipulation results are preliminary, and several experimental aspects lack the depth needed for strong conclusions. The paper would benefit from more rigorous statistical analysis, stronger baselines, and deeper mobile manipulation evaluation.

Rating:5.8/ 10

Significance 6Rigor 5.5Novelty 5Clarity 7

Generated Apr 19, 2026

Comparison History (46)

vs. RotVLA: Rotational Latent Action for Vision-Language-Action Model

claude-opus-4.65/16/2026

RotVLA introduces a fundamentally novel representation—rotational latent actions on SO(n)—that provides a principled mathematical framework for cross-embodiment action representation with continuity, compositionality, and geometric structure. This theoretical contribution is more innovative than LLaVA-VLA's engineering-focused improvements. RotVLA also achieves strong empirical results (98.2% LIBERO, strong RoboTwin2.0) with a compact 1.7B model, demonstrating that the representation itself drives performance. The idea of using rotation groups for latent actions is likely to inspire follow-up work across robotics and representation learning, giving it broader potential impact.

vs. CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

claude-opus-4.64/1/2026

Paper 2 addresses broader and more practical challenges in VLA models—scalability, deployment on consumer GPUs, cross-embodiment generalization, and mobile manipulation. It contributes a comprehensive benchmark (CEBench) with both simulated and real-world data, introduces a lightweight deployable VLA (LLaVA-VLA), and promises full open-source release. Its breadth of impact across diverse embodiments, practical deployment considerations, and benchmark contribution give it higher potential to influence the field. Paper 1, while technically sound with strong LIBERO results, is narrower in scope, focusing on a specific planning framework for manipulation.

vs. Metamorphic Testing of Vision-Language Action-Enabled Robots

gpt-5.23/25/2026

Paper 1 likely has higher impact due to broader novelty and utility: it introduces a substantial new cross-embodiment benchmark (sim+real, domain randomization) with sizable datasets and an improved lightweight baseline enabling practical deployment and unified navigation+manipulation, including real-world mobile manipulation results and planned open-sourcing. This can become a community standard for training/evaluating VLAs, influencing robotics, multimodal learning, and benchmarks. Paper 2 is rigorous and timely for reliability, but is narrower (testing methodology) and likely more incremental, with impact mainly in VLA verification rather than core capability advances.

vs. HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

gpt-5.23/19/2026

Paper 1 likely has higher scientific impact due to broader and more foundational contributions: a new cross-embodiment benchmark (simulation+real) plus a lightweight end-to-end VLA baseline aimed at democratizing training/deployment on consumer GPUs, with open-source commitments that can catalyze community adoption. Its potential applications span generalist robotics (manipulation and navigation+manipulation) and can influence evaluation norms. Paper 2 is timely and clever but more narrow (inference acceleration for AR VLAs), depends on environment-specific retrieval databases/tuning, and relaxes SD guarantees with measurable success-rate drops, limiting generality.

vs. D-SLAMSpoof: An Environment-Agnostic LiDAR Spoofing Attack using Dynamic Point Cloud Injection

gpt-5.23/13/2026

Paper 2 is likely to have higher scientific impact because it exposes a broadly relevant, safety-critical vulnerability in LiDAR SLAM in realistic feature-rich settings and pairs it with a deployable defense using standard onboard sensors. The attack is principled (scan-matching–guided) and evaluated across multiple SLAM systems with strong gains in success rate, giving it cross-system relevance. Paper 1 is timely and useful but more incremental (benchmark + lightweight baseline), with weaker real-world evidence and missing key contemporary baselines, limiting field-shaping impact.

vs. SSMG-Nav: Enhancing Lifelong Object Navigation with Semantic Skeleton Memory Graph

gemini-33/12/2026

Paper 2 introduces a novel, principled memory representation (SSMG) with strong empirical results, notably doubling the Success weighted by Path Length (SPL) over baselines. In contrast, Paper 1 offers a more incremental contribution with notable methodological flaws, including missing key baselines and weak evidence for its mobile manipulation claims. Paper 2's rigorous approach to lifelong navigation provides a more significant and reliable advance for the embodied AI community.

vs. SEA-Nav: Efficient Policy Learning for Safe and Agile Quadruped Navigation in Cluttered Environments

gpt-5.23/11/2026

Paper 1 has higher likely scientific impact due to broader scope and community leverage: it introduces a cross-embodiment benchmark (sim+real) plus an open-source lightweight VLA baseline that targets a widely active, fast-growing VLA ecosystem. If adopted, CEBench and LLaVA-VLA can standardize evaluation and accelerate many downstream robotics/VLM works across manipulation and mobile manipulation. Paper 2 is strong for quadruped navigation, but is narrower in domain and its headline “minute-level training” and safety claims are supported by limited real-world trials, reducing expected cross-field and long-term impact.

vs. Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

claude-opus-4.63/11/2026

Paper 1 (SACA) introduces a conceptually novel framework for extracting dense supervision from failed trajectories via structural decomposition, achieving substantial state-of-the-art improvements (7.5-11.7% SR gains). Its core insight—zero-shot trajectory auditing for credit assignment—generalizes beyond VLN-CE to broader long-horizon decision-making problems. Paper 2 (LLaVA-VLA) makes practical but largely incremental contributions, combining existing techniques (multi-view, action chunking, proprioception tokenization) with weak mobile manipulation evidence (10 trials, 40% success). Paper 1 demonstrates stronger novelty, larger performance margins, and broader methodological applicability.

vs. MO-Playground: Massively Parallelized Multi-Objective Reinforcement Learning for Robotics

claude-opus-4.63/11/2026

Paper 1 addresses a broader and more impactful problem space (practical VLA deployment across diverse embodiments) with stronger methodological contributions including a comprehensive benchmark, systematic ablation studies, and novel architectural integration. Despite limitations in mobile manipulation evaluation, its open-source commitment, strong CALVIN results, and relevance to the rapidly growing VLA community give it higher potential impact. Paper 2, while timely and practically useful, is primarily an engineering/systems contribution with limited algorithmic novelty, a single baseline comparison, and no statistical analysis.

vs. VLN-Pilot: Large Vision-Language Model as an Autonomous Indoor Drone Operator

claude-opus-4.62/27/2026

Paper 1 offers substantially more rigorous evaluation across multiple benchmarks (CALVIN, RoboTwin, CEBench), diverse embodiments, and real-world experiments. It provides systematic ablations, a reusable benchmark with 16k trajectories, strong CALVIN results with a 0.5B model, and commits to open-sourcing all artifacts. While incremental, it addresses a genuine practical bottleneck (lightweight VLA deployment). Paper 2 is a preliminary proof-of-concept with a single 3-room environment, no baselines, no standard metrics, 5 repetitions per condition, and claims that far exceed the evidence presented.

vs. Bayesian Preference Elicitation: Human-In-The-Loop Optimization of An Active Prosthesis

gemini-32/27/2026

Paper 1 addresses a major bottleneck in a rapidly exploding field (Robotics/AI): the extreme computational cost of Vision-Language-Action models. By enabling VLA training on consumer GPUs and demonstrating performance comparable to 7B models with only 0.5B parameters, it democratizes access for many researchers. While Paper 2 solves a concrete clinical problem with elegant methods, its impact is limited by an extremely small sample size (n=1 for the main algorithm) and niche application domain compared to the broad applicability of generalist robot foundation models in Paper 1.

vs. Marinarium: a New Arena to Bring Maritime Robotics Closer to Shore

gpt-5.22/27/2026

Paper 2 has higher likely scientific impact because it targets a rapidly expanding, high-visibility area (vision-language-action robotics) and contributes an adoptable benchmark plus a lightweight baseline aimed at democratizing VLA deployment on consumer GPUs. If released as promised, the dataset/code can catalyze broad follow-on work across ML and robotics. Despite methodological gaps (limited real-world/mobile evidence, missing contemporary baselines), its potential reach and timeliness exceed Paper 1’s primarily infrastructural contribution, whose impact is valuable but more localized to labs able to build/use similar facilities.

vs. An Empirical Analysis of Cooperative Perception for Occlusion Risk Mitigation

claude-opus-4.62/27/2026

Paper 2 addresses the highly active and rapidly growing VLA/robotics foundation model field with a benchmark and lightweight baseline that democratizes access to VLA training on consumer GPUs. Despite methodological concerns, its open-source commitment, systematic ablations, and timeliness within a crowded research front give it broader community impact. Paper 1 makes solid but narrower contributions to V2X deployment strategy. Paper 2's breadth across embodiments, its benchmark potential for community adoption, and relevance to the booming foundation-model-for-robotics wave give it higher estimated impact.

vs. TOLEBI: Learning Fault-Tolerant Bipedal Locomotion via Online Status Estimation and Fallibility Rewards

gemini-32/27/2026

While Paper 1 addresses an important niche (fault tolerance in humanoids), Paper 2 targets the extremely hot and broad topic of Vision-Language-Action (VLA) models. Paper 2's focus on democratizing VLAs (running on consumer GPUs) and eliminating massive pre-training addresses a major bottleneck in the field right now. Its systematic benchmarking across diverse embodiments and open-sourcing of code/data for a lightweight, effective model is likely to see wider adoption and citation by the broader robotics and AI community compared to the more specialized application of Paper 1.

vs. EgoAVFlow: Robot Policy Learning with Active Vision from Human Egocentric Videos via 3D Flow

claude-opus-4.62/27/2026

EgoAVFlow addresses a genuinely novel and underexplored problem—decoupling active vision from human viewpoint imitation for robot learning from egocentric videos—with a principled and elegant solution using 3D flow as a unified representation. Its conceptual contribution (visibility-aware reward with test-time denoising refinement) is more original and could influence multiple fields. Paper 2, while practically useful, makes largely incremental contributions by combining existing techniques (action chunking, multi-view, proprioception tokenization) into a lightweight VLA, with weak mobile manipulation evidence and missing key baselines. Paper 1's problem identification and technical novelty are stronger drivers of scientific impact.

vs. Considering Perspectives for Automated Driving Ethics: Collective Risk in Vehicular Motion Planning

claude-opus-4.62/27/2026

Paper 1 addresses a rapidly growing field (VLA models) with broad community interest, offers open-source artifacts (benchmark, code, checkpoints) that enable direct adoption, and targets a practical bottleneck (consumer-grade GPU deployment) affecting many researchers. Despite methodological limitations, its timeliness in a hot area, cross-embodiment benchmark, and engineering contributions will likely attract more citations and downstream usage. Paper 2 makes a conceptually elegant but narrower contribution to AV ethics with significant simplifying assumptions (non-interactive agents, uniform uncertainty scaling) that limit near-term practical adoption.

vs. Robust Helicopter Ship Deck Landing With Guaranteed Timing Using Shrinking-Horizon Model Predictive Control

gemini-32/27/2026

Paper 1 addresses a rapidly expanding and highly cited field (Vision-Language-Action models/Robotics). It democratizes research by enabling consumer-grade GPU training, introduces a benchmark (CEBench), and releases open-source code/models, which typically drives high citation counts and community adoption. While Paper 2 is methodologically rigorous and provides solid engineering for a specific control problem, its impact is narrower, limited to the specialized niche of autonomous helicopter landing, and relies on simulation-only validation without the broad applicability or open-source ecosystem of Paper 1.

vs. Automated Robotic Needle Puncture for Percutaneous Dilatational Tracheostomy

gpt-5.22/27/2026

Paper 2 has higher likely scientific impact due to broader, faster-moving relevance and community reuse: it introduces a cross-embodiment benchmark plus an open-source lightweight VLA baseline targeting a key bottleneck (compute/pretraining), which can influence many labs and downstream applications across robotics and ML. Although empirical support for mobile manipulation is thin and some baselines are missing, benchmarks/datasets and practical training recipes often become widely adopted. Paper 1 is methodologically rigorous and clinically meaningful but is narrow in scope and early-stage (mannequin-only), limiting near-term uptake and cross-field impact.

vs. HiCrowd: Hierarchical Crowd Flow Alignment for Dense Human Environments

claude-opus-4.62/27/2026

Paper 2 addresses a broader and more timely problem—making VLA models practical for consumer-grade deployment across diverse embodiments—with higher potential to influence the rapidly growing VLA community. Its benchmark (CEBench), systematic ablations, open-source commitment, and strong CALVIN results with a 0.5B model provide actionable insights for a wide audience. While both papers have methodological limitations, Paper 2's scope spans multiple embodiments and tasks, bridges manipulation and navigation, and targets a critical scalability bottleneck. Paper 1, though well-designed, addresses a narrower problem (crowd navigation) with more incremental contributions.

vs. When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

gemini-32/27/2026

Paper 1 (UPS) introduces a fundamentally novel theoretical framework for safe robot deployment by combining conformal prediction with VLM reasoning. Its methodological rigor, providing statistical guarantees for when a robot should ask for help versus act, addresses a critical bottleneck in trust and reliability for embodied AI. The Bayesian intent factorization is a sophisticated contribution to VLM calibration. In contrast, Paper 2 is largely an engineering consolidation—combining existing techniques (action chunking, multi-view perception) to create a lightweight model. While practical, it lacks the theoretical depth and novel algorithmic contribution of Paper 1.