VideoVLA:
Video Generators Can Be Generalizable Robot Manipulators

NeurIPS2025

Yichao Shen1,2,†, Fangyun Wei2,‡, Zhiying Du3,†, Yaobo Liang2, Yan Lu2,
Jiaolong Yang2,‡, Nanning Zheng1,‡, Baining Guo1
Interns at Microsoft Research   Corresponding Authors
1IAIR, Xi'an Jiaotong University  2Microsoft Research Asia  3Fudan University 

We present VideoVLA, a simple approach that explores the potential of directly transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments’ skills and handling novel objects. This dual-prediction strategy—forecasting both actions and their visual consequences—explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.

Motivation

Alignment between video generator and robot manipulator.

We found that there are three dimensions of alignment between video generator and robot manipulator. Firstly, the situation of video generators handling novel text and novel image conditions shows a natural alignment with that of robot manipulators dealing with unseen instructions and unseen observations. Secondly, the understanding of physical dynamics learned by video generators is also a fundamental capability required for any high-performing robotic manipulator to reason about the physical consequences of their generated actions. Furthermore, video generators can predict future world states by following given instructions, which inherently reflects a planning capability that is also crucial for robotic manipulation models to anticipate and organize their interactions with the physical environment. Motivated by these observations, we aim to explore the following question: "Can large video generators be seamlessly adapted into generalizable robotic manipulators?"

The VideoVLA Model

VideoVLA model architecture.

Our core idea is to leverage the cognitive information extracted by powerful VLMs to guide the action prediction of a specialized action module. CogACT-VLA has three componentized modules:

  • Pretrained Video Generator Backbone: Our model's DiT backbone is built upon CogVideoX one of the most powerful video generation models. The use of pre-trained video generation models enables the system to interpret language instructions and generate plausible imagined futures
  • Dual-Prediction Strategy: VideoVLA jointly predicts the future actions and generates the corresponding future visual contents that would result from executing these actions in the current environment, supervised by a DDPM Diffusion loss. VideoVLA’s dual-prediction strategy fosters a strong correlation between predicted actions and their corresponding visual consequences.
  • Unified Future Modeling: All modalities are projected into the shared token representations with a common embedding dimension, leveraging the advantages of unified DiT architecture.

Experimental Results

Result Summary

On in-domain tasks, VideoVLA demonstrates strong performance, and further exhibits robust generalization capabilities, including the ability to emulate the new skills transferred from other embodiments and to manipulate previously unseen novel objects.

In-domain
Generalization to Novel Objects
Generalization to New Skills Transfer

Evaluation in Simulation

In-domain Evaluation

We evaluate VideoVLA in the SIMPLER evaluation environment for testing the VideoVLA performance on in-domain tasks. SIMPLER offers two evaluation protocols—Visual Matching (VM) and Variant Aggregation (VA)—to assess the performance of models using the Google robot and WidowX robot.

In-domain evaluation of VideoVLA and prior VLA models using the WidowX robot and Google robot within the SIMPLER simulation environment. All models are trained on the OXE dataset.

Visualization of Predicted Action and Video

"Predicted Action" shows the visual results of the actions predicted by the VideoVLA during task completion, while "Predicted Video" refers to the model's corresponding video prediction.

Google Robot

Pick coke can.
Move 7up can near apple.
Close middle drawer.
Move orange near pepsi can.

WidowX Robot

Put carrot on plate.
Put eggplant into yellow basket.
Put the spoon on the towel.
Stack the green block on the yellow block.

Novel Objects Evaluation

We select objects from the other 3D asset datasets, including YCB and GSO, that are not present in the Google robot’s training data and import them into the SIMPLER environment. We evaluate VideoVLA on the “Pick Up” skill using the Google robot across 10 novel objects.

Evaluation of generalization to novel objects using the Google robot under the SIMPLER environment.

Visualization of Predicted Action and Video

"Predicted Action" shows the visual results of the actions predicted by the VideoVLA during task completion, while "Predicted Video" refers to the model's corresponding video prediction.
Pick up eggplant.
Pick up plum.
Pick up strawberry.
Pick up wrench.

New Skills Transfer Evaluation

VideoVLA is trained on the OXE dataset, which includes a diverse set of embodiments, each potentially associated with a distinct, non-overlapping set of skills. To assess skill generalization, we evaluate the model’s ability to transfer skills from the WidowX robot to the Google robot, meaning that these skills are included in the WidowX training data but excluded from the Google robot’s training set.

Evaluation of generalization to Skills Transfer using the Google robot within the SIMPLER environment. The new skills are transferred from the WidowX robot, as they are present in WidowX robot’s training data but absent from the Google robot’s training set. {L,R,U,B} denotes {Left, Right, Upper, Bottom}.

Visualization of Predicted Action and Video

"Predicted Action" shows the visual results of the actions predicted by the VideoVLA during task completion, while "Predicted Video" refers to the model's corresponding video prediction.
Stack the green block on the yellow block.(from WidowX)
Take out of apple.(from WidowX)
Put carrot on plate.(from WidowX)
Put the spoon on the towel.(from WidowX)

Evaluation in Real World

We evaluate VideoVLA with a Realman robot, which is equipped with a 7-DoF arm and a gripper, to perform real-world tasks such as picking, stacking, and placing objects. All models are finetuned on our collected dataset using the Realman robot.

In-Domain Evaluation.

For real-world in-domain evaluation, we assess performance on three tasks:

  1. Pick up the [Object] and place it onto the [Color] plate, where Object ∈ {Banana, Lemon, Avocado}, and Color ∈ {White, Blue, Yellow};
  2. Stack the [Color] [Object] into the [Color] [Object], where Object ∈ {Cup, Bowl} and Color ∈ {Pink, White, Blue, Yellow};
  3. Place the [Color] block onto the [Color] block, where Color ∈ {Red, Orange, Blue, Green, Yellow}.

To increase task difficulty and test robustness, we introduce novel distractor objects into the scene.

Real-world in-domain evaluation using the Realman robot. All models are pre-trained on the OXE dataset and subsequently fine-tuned on our collected dataset.

Novel Objects Evaluation.

Using the Realman robot, we perform the task: “Pick up the [Novel Object] and place it onto the [Color] plate”, where each Novel Object is chosen from a set of novel objects not seen during finetuneing.

Evaluation of real-world generalization to novel objects using the Realman robot.

New Skills Transfer Evaluation.

In this experiment, we train our model and all baseline models on a combined dataset consisting of the WidowX Robot dataset and our own collected dataset. To evaluate skill transfer, we focus on skills that are observed by the WidowX robot during training but never demonstrated by the Realman robot.

Evaluating real-world cross-embodiment skill transfer: our Realman robot performs novelskills learned only by the WidowX robot.

Visualization of Predicted Action and Video

"Predicted Action" shows the visual results of the actions predicted by the VideoVLA during task completion, while "Predicted Video" refers to the model's corresponding video prediction.
Pick up the white cup.
Put the blue ball into pink bowl.
Move the blue block near the red block.
Take the blue block out of plate.

BibTeX

@article{
    videovla,
    title={VideoVLA: Video Generators Can Be Generalizable Robot Manipulators},
    author={Yichao Shen and Fangyun Wei and Zhiying Du and Yaobo Liang and Yan Lu and Jiaolong Yang and Nanning Zheng and Baining Guo},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems(NeurIPS2025)},
    year={2025},
    url={https://openreview.net/forum?id=UPHlqbZFZB}
    }