Alignment between video generator and robot manipulator. |
We found that there are three dimensions of alignment between video generator and robot manipulator. Firstly, the situation of video generators handling novel text and novel image conditions shows a natural alignment with that of robot manipulators dealing with unseen instructions and unseen observations. Secondly, the understanding of physical dynamics learned by video generators is also a fundamental capability required for any high-performing robotic manipulator to reason about the physical consequences of their generated actions. Furthermore, video generators can predict future world states by following given instructions, which inherently reflects a planning capability that is also crucial for robotic manipulation models to anticipate and organize their interactions with the physical environment. Motivated by these observations, we aim to explore the following question: "Can large video generators be seamlessly adapted into generalizable robotic manipulators?"
VideoVLA model architecture.
Our core idea is to leverage the cognitive information extracted by powerful VLMs to guide the action prediction of a specialized action module. CogACT-VLA has three componentized modules:
On in-domain tasks, VideoVLA demonstrates strong performance, and further exhibits robust generalization capabilities, including the ability to emulate the new skills transferred from other embodiments and to manipulate previously unseen novel objects.
We evaluate VideoVLA in the SIMPLER evaluation environment for testing the VideoVLA performance on in-domain tasks. SIMPLER offers two evaluation protocols—Visual Matching (VM) and Variant Aggregation (VA)—to assess the performance of models using the Google robot and WidowX robot.
In-domain evaluation of VideoVLA and prior VLA models using the WidowX robot and Google robot within the SIMPLER simulation environment. All models are trained on the OXE dataset. |
We select objects from the other 3D asset datasets, including YCB and GSO, that are not present in the Google robot’s training data and import them into the SIMPLER environment. We evaluate VideoVLA on the “Pick Up” skill using the Google robot across 10 novel objects.
Evaluation of generalization to novel objects using the Google robot under the SIMPLER environment. |
VideoVLA is trained on the OXE dataset, which includes a diverse set of embodiments, each potentially associated with a distinct, non-overlapping set of skills. To assess skill generalization, we evaluate the model’s ability to transfer skills from the WidowX robot to the Google robot, meaning that these skills are included in the WidowX training data but excluded from the Google robot’s training set.
Evaluation of generalization to Skills Transfer using the Google robot within the SIMPLER environment. The new skills are transferred from the WidowX robot, as they are present in WidowX robot’s training data but absent from the Google robot’s training set. {L,R,U,B} denotes {Left, Right, Upper, Bottom}. |
We evaluate VideoVLA with a Realman robot, which is equipped with a 7-DoF arm and a gripper, to perform real-world tasks such as picking, stacking, and placing objects. All models are finetuned on our collected dataset using the Realman robot.
For real-world in-domain evaluation, we assess performance on three tasks:
To increase task difficulty and test robustness, we introduce novel distractor objects into the scene.
Real-world in-domain evaluation using the Realman robot. All models are pre-trained on the OXE dataset and subsequently fine-tuned on our collected dataset. |
Using the Realman robot, we perform the task: “Pick up the [Novel Object] and place it onto the [Color] plate”, where each Novel Object is chosen from a set of novel objects not seen during finetuneing.
Evaluation of real-world generalization to novel objects using the Realman robot. |
In this experiment, we train our model and all baseline models on a combined dataset consisting of the WidowX Robot dataset and our own collected dataset. To evaluate skill transfer, we focus on skills that are observed by the WidowX robot during training but never demonstrated by the Realman robot.
Evaluating real-world cross-embodiment skill transfer: our Realman robot performs novelskills learned only by the WidowX robot. |
@article{
videovla,
title={VideoVLA: Video Generators Can Be Generalizable Robot Manipulators},
author={Yichao Shen and Fangyun Wei and Zhiying Du and Yaobo Liang and Yan Lu and Jiaolong Yang and Nanning Zheng and Baining Guo},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems(NeurIPS2025)},
year={2025},
url={https://openreview.net/forum?id=UPHlqbZFZB}
}