VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

Recent advancements utilizing large-scale video data for learning video generation models demonstrate significant potential in understanding complex physical dynamics. It suggests the feasibility of leveraging diverse robot trajectory data to develop a unified, dynamics-aware model to enhance robot manipulation. However, given the relatively small amount of available robot data, directly fitting data without considering the relationship between visual observations and actions could lead to suboptimal data utilization. To this end, we propose VidMan (Video Diffusion for Robot Manipulation), a novel framework that employs a two-stage training mechanism inspired by dual-process theory from neuroscience to enhance stability and improve data utilization efficiency. Specifically, in the first stage, VidMan is pre-trained on the Open X-Embodiment dataset (OXE) for predicting future visual trajectories in a video denoising diffusion manner, enabling the model to develop a long horizontal awareness of the environment's dynamics. In the second stage, a flexible yet effective layer-wise self-attention adapter is introduced to transform VidMan into an efficient inverse dynamics model that predicts action modulated by the implicit dynamics knowledge via parameter sharing. Our VidMan framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset. These results provide compelling evidence that world models can significantly enhance the precision of robot action prediction.

Method

VidMan's two-stage training paradigm mirrors dual process theory: its first stage (like System 2) pre-trains on understanding environment dynamics through video diffusion, forming a foundation for accurate action prediction, while its second stage (like System 1) was adapted from the first stage to leverage the learned dynamics knowledge for rapid, low-level action inference.

VidMan employs a dual-stage training strategy: in the first stage, the Dynamics-aware Visionary Stage, we enable the model to forecast and imagine potential future trajectories based on historical observations, leveraging the multi-frame prediction capability of the video diffusion model. Through this stage, the model is optimized to understand the dynamics of the environment. In the second stage, the Dynamics-modulated Action Stage, we introduce a lightweight layer-wise adapter to seamlessly integrate the visionary predictive stage with fast, adaptive action prediction. This approach decouples the knowledge of the world and embodiment into distinct processes while ensuring seamless integration through the training and utilization of shared parameters.

CALVIN Benchmark Experiments

We choose CALVIN, an open-source benchmark to learn long term tasks, as our experimental platform, and use the corresponding data as demonstration data for imitation learning. CALVIN provides 24 hours of unstructured tele-operated play data, with 1% annotated with language descriptions. Each instruction chain consists of five sequential language instructions for execution. Evaluation follows a zero-shot generalization setup, training models on environments A, B, and C and testing on D. Performance metrics include success rates and average completion of sequential tasks. Results are shown in the below.

Lift the blue block in the slider.

Place the red block in the drawer.

Open the drawer.

Turn off the led.

We also report offline metrics, including the average of xyz accuracy and euler angle accuracy (Avg xyz ang) and MSE for end-to-end action prediction on Bridge, Taco Play, Cable Routing and Autolab UR5, which are presented in OXE. Following Octo, we use continuous action space. XYZ accuracy measures the precision of the robot's predicted 3D position (X, Y, Z coordinates) compared to the ground truth values during evaluation. Euler angle accuracy measures the precision of the robot's predicted orientation angles (rotations around X, Y, and Z axes) compared to the ground truth values during offline evaluation. Specifically, XYZaccuracy refers to whether we predicted the XYZ delta within 0.5 radians and 50% of the norm while in motion. Euler angle accuracy indicates whether we predicted the rotation delta within 0.5 radians during movement. Additionally, we reported the mean squared error (MSE) which reflects how well each model predicts the actions. We present the offline evaluation results and video prediction results on OXE below.

Move red pepper to above green towel.

Fold the cloth from top left to bottom right.

Unfold the cloth from top left to bottom right.

Put carrot in pot cardboard fence.

Conclusions

We propose VidMan, a novel framework utilizing video diffusion models for robot imitation learning, which addresses the limitations of current GPT-style paradigms in real-time applications. By combining a Dynamics-aware Visionary Stage, which develops a deep understanding of environment dynamics through pre-training on the Open X-Embodiment dataset, with a Dynamics modulated Action Stage that efficiently integrates this knowledge into action prediction, VidMan achieves both high precision and computational efficiency. This two-stage approach, ensures robust and rapid action generation, significantly improving performance on benchmarks like CALVIN and the OXE dataset. In the future, we will expand VidMan to be able to perceive more dimensions of information.

BibTeX

@inproceedings{wenvidman,
               title={VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation},
               author={Wen, Youpeng and Lin, Junfan and Zhu, Yi and Han, Jianhua and Xu, Hang and Zhao, Shen and Liang, Xiaodan},
               booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}
               }