Shizhe Chen, Paul Pacaud, Cordelia Schmid
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.
PointACT addresses a well-recognized limitation of current VLA models: their reliance on 2D visual representations, which constrains spatial reasoning for precise robotic manipulation. The paper proposes a dual-system architecture where a frozen pretrained VLM backbone handles semantic understanding while a dedicated "point-action expert" integrates hierarchical 3D point cloud features directly into action decoding. The key technical novelty is the multi-scale point-action interaction mechanism using bottleneck window self-attention, where action tokens serve as information bottlenecks that aggregate local geometric context from spatially-windowed point cloud partitions across multiple PTv3 encoder stages. This design achieves linear computational scaling with respect to the number of windows while allowing dense geometric conditioning of evolving action tokens.
The paper also provides a systematic comparison of where to inject 3D information in VLAs—backbone-level (monolithic) vs. action-expert-level (dual-system)—which is a useful architectural study for the community.
The experimental design is generally thorough. The paper evaluates on two complementary benchmarks (LIBERO for short-horizon delta actions; RLBench for long-horizon keypoint actions), covers both regression and classification action heads, and includes real-robot experiments on two platforms (SO-100 and UR5).
The work has moderate-to-high practical relevance:
The 10% improvement on RLBench-10Tasks over EO1 (73.2 → 82.3) is substantial, particularly given the challenging nature of the benchmark.
This work is highly timely. VLAs are experiencing rapid development (π0, GR00T, OpenVLA, EO1), and the 2D-to-3D gap is widely acknowledged as a bottleneck. The concurrent emergence of strong 3D pretrained models (PTv3, etc.) creates a natural opportunity for this type of integration. The paper positions itself well within the current discourse on how to augment VLAs with spatial reasoning without disrupting pretrained knowledge.
PointACT makes a solid engineering and empirical contribution to the active area of 3D-aware VLA design. The multi-scale point-action interaction mechanism is well-motivated, efficiently designed, and convincingly validated through ablations. The systematic comparison of 3D integration strategies provides useful architectural guidance. However, the novelty is primarily in the integration design rather than in fundamentally new algorithmic concepts. The real-world validation, while present, is limited in scale. The work represents a meaningful incremental advance that could influence how future VLA systems incorporate 3D geometry.
Generated May 21, 2026
Paper 2 introduces a fundamental shift in embodied trajectory generation by combining flow matching with compositional motion primitives directly in physical space. This addresses the sample inefficiency of monolithic generators and offers broad applicability across various embodied AI domains (manipulators, mobile robots). While Paper 1 provides a strong architectural improvement for VLA models using 3D data, Paper 2's methodological innovation in structured generative modeling presents a more paradigm-shifting approach with higher potential for widespread impact across the field.
Paper 2 addresses a critical bottleneck in the rapid deployment of humanoid robots: cross-embodiment transfer. By enabling the reuse of whole-body tracking models with only 1% of the original compute and data, it offers a highly scalable and cost-effective paradigm. While Paper 1 provides a strong methodological improvement for 3D manipulation, Paper 2's potential to dramatically accelerate the adoption and development of diverse humanoid platforms gives it a higher potential for broad scientific and industry impact.
PointACT addresses the fundamental limitation of 2D representations in VLA models by integrating hierarchical 3D point cloud representations, which is highly relevant to the rapidly growing VLA/foundation model community. It demonstrates strong empirical gains (10%+ on RLBench) and offers broadly applicable insights about coupling 3D geometry with 2D semantics. While Paper 1 makes a solid contribution to humanoid imitation learning with its direct dynamic retargeting approach, Paper 2 targets a larger and more active research area (general-purpose robotic manipulation via foundation models), has broader applicability across manipulation tasks, and its architectural innovations are more likely to influence subsequent work in the VLA space.
Paper 2 addresses a fundamental limitation in current VLA models—lack of 3D spatial reasoning—by directly integrating multi-scale point cloud representations into the action decoding process. While Paper 1 offers an innovative human-robot interaction approach using gestures, Paper 2's focus on fine-grained geometric grounding is more broadly applicable to core robotic manipulation challenges. Furthermore, Paper 2 demonstrates strong methodological rigor with systematic evaluations on established benchmarks (RLBench and LIBERO), making its architectural contributions highly likely to influence future foundational models in robotics.
SOMA addresses a more fundamental and underexplored limitation of VLA models—operating when task-relevant objects are out of the camera's field of view. This is a highly practical real-world constraint that most existing work ignores. The spatial memory framework introduces a novel architectural concept (persistent memory with construction, refinement, and retrieval) validated on real-world tasks including dual-arm scenarios. PointACT, while solid, addresses the more incremental problem of integrating 3D point clouds into VLAs, which has been explored in various forms. SOMA's problem framing is more novel and has broader implications for deploying robots in realistic, partially observable environments.
Paper 1 offers higher potential scientific impact because it addresses a critical, universal bottleneck in robotics research: engineering fragmentation. By providing an LLM-driven, plug-and-play harness for cross-validating policies, simulators, and hardware, NAUTILUS functions as foundational infrastructure. While Paper 2 presents a strong architectural advancement for 3D-aware VLA models, Paper 1's tool could accelerate the entire field's workflow, similar to how unified frameworks revolutionized deep learning. Foundational tools that lower barriers to entry and standardize evaluation typically achieve broader, cross-cutting impact than specific algorithmic improvements.
PointACT addresses a fundamental limitation of VLA models by integrating 3D point cloud representations into action decoding, with broad implications for the rapidly growing VLA field. Its systematic evaluation on standard benchmarks (LIBERO, RLBench) with 10% improvements over SOTA, comprehensive ablations, and demonstration that hierarchical 3D geometry coupling matters provide foundational insights applicable across many robotic manipulation settings. Paper 2, while practically valuable with real-world dexterous manipulation results, addresses a narrower problem with a more specialized retrieve-align-execute pipeline that may have less generalizable impact across the robotics community.
PointACT addresses a fundamental limitation in VLA models by integrating 3D point cloud representations into action decoding, demonstrating significant improvements (10%+ success rate gains) on established benchmarks (LIBERO, RLBench). It tackles a core challenge in robotic manipulation with broad applicability. Q-SpiRL, while novel in combining quantum computing with spiking networks for RL, is evaluated only on simple grid-world environments, limiting real-world impact. The quantum advantage remains unclear, and quantum hardware constraints limit near-term practical deployment. PointACT's contributions are more immediately impactful for the robotics community.
Paper 1 addresses a major limitation in general-purpose robotic manipulation by integrating 3D spatial awareness into Vision-Language-Action (VLA) models. Given the rapid growth and broad applicability of foundation models in robotics, this approach has significantly higher potential for widespread impact across the field compared to Paper 2, which focuses on a highly specialized hardware paradigm (inflatable truss robots).
BlockVLA addresses a fundamental efficiency bottleneck in VLA deployment (inference latency) with a novel architectural bridge between autoregressive and diffusion paradigms. The 3.3x inference acceleration and faster training convergence have broad implications for real-time robotic control. While PointACT's 3D-aware approach is valuable, BlockVLA's contribution is more foundational—it introduces a new computational paradigm (block diffusion for VLAs) that could be combined with various representation improvements including 3D awareness. The efficiency gains are critical for practical deployment, giving it broader potential impact.