MotionLCM: The Fastest and Best Motion Generation Model
Recently, we released MotionLCM, a single-step diffusion model that supports real-time motion generation and controlling! The paper, code, demo, project homepage, etc., have all been made public.
MotionLCM focuses on the foundational task of lifelike motion generation, aiming to produce reasonable and realistic human body motions. The core challenge faced by previous diffusion model-based works has been efficiency, with inference times being exceptionally long. Inspired by consistency distillation, MotionLCM proposes to generate reasonable latent codes in latent space in a single step and obtain reasonable motions via a decoder. MotionLCM supports inference pipelines of 1-4 steps, with almost no difference in effectiveness between 1 and 4 steps. Its efficiency compared to diffusion-based models has seen significant improvement. Below is a comparison of FID and speed. Generating approximately 200 frames of motion only takes about 30ms, which averages to approximately 6k fps per frame.
Without a doubt, we have achieved a trade-off between speed and generation quality. To push this work forward, when discussing with Wenxun and Jingbo, we pondered one question: What is the primary application scenario for real-time generation algorithms? We unanimously agreed to explore the controllability of MotionLCM because editing and controllability demand the highest level of real-time performance. When users need to determine and edit the output motion based on given conditions (such as trajectories) in real-time, algorithmic instant feedback is crucial. Therefore, we integrated a control module into the diffusion of the latent space, called Motion ControlNet, to achieve controllable motion generation. Numerically, our control algorithm is approximately 1,000 times faster than the best performance baseline, with comparable quality.
We provide some demos of text-to-motion and controllable motion generation results in the following video. MotionLCM supports dense or sparse condition signals (video link here).
Additionally, we have provided a HuggingFace interactive interface for everyone to test, supporting the generation of diverse results and different motion durations. However, since there are currently no GPUs on the platform, only shared CPU resources are available, which means you can't experience real-time generation effects on the platform. You can download it and deploy it locally to experience it firsthand. The demo is here.
Blog written by Ling-Hao (Evan) Chen. Credit also with Wenxun, Jingbo, Jinpeng, Bo, Yansong.
📜 Citation
@article{motionlcm,
title={MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model},
author={Dai, Wenxun and Chen, Ling-Hao and Wang, Jingbo and Liu, Jinpeng and Dai, Bo and Tang, Yansong},
journal={arXiv preprint arXiv:2404.19759},
year={2024},
}