MS-Vid2Vid-XL / README.md
hysts's picture
hysts HF staff
Fix image link
10a6ce4
|
raw
history blame
6.52 kB
metadata
backbone:
  - diffusion
domain:
  - multi-modal
frameworks:
  - pytorch
license: cc-by-nc-nd-4.0
metrics:
  - realism
  - video-video similarity
studios:
  - damo/Video-to-Video
tags:
  - video2video generation
  - diffusion model
  - 视频到视频
  - 视频超分辨率
  - 视频生成视频
  - 生成
tasks:
  - video-to-video
widgets:
  - examples:
      - inputs:
          - data: A panda eating bamboo on a rock.
            name: text
          - data: XXX/test.mpt
            name: video_path
        name: 2
        title: 示例1
    inferencespec:
      cpu: 4
      gpu: 1
      gpu_memory: 28000
      memory: 32000
    inputs:
      - name: text, video_path
        title: 输入英文prompt, 视频路径
        type: str, str
        validator:
          max_words: 75, /
    task: video-to-video

Video-to-Video

本项目MS-Vid2Vid由达摩院研发和训练,主要用于提升文生视频、图生视频的分辨率和时空连续性,其训练数据包含了精选的海量的高清视频、图像数据(最短边>720),可以将低分辨率的(16:9)的视频提升到更高分辨率(1280 * 720),可以用于任意低分辨率的的超分,本页面我们将称之为MS-Vid2Vid-XL

The MS-Vid2Vid project is developed and trained by Damo Academy and is primarily used to enhance the resolution and spatiotemporal continuity of text-generated videos and image-generated videos. The training data consists of a large selection of high-definition videos and image data (with a minimum short side length of 720), which can upscale low-resolution (16:9) videos to higher resolutions (1280 * 720). It can be used for arbitrary low-resolution super-resolution tasks. On this page, we refer to it as MS-Vid2Vid-XL.


Fig.1 Video-to-Video-XL

模型介绍 (Introduction)

MS-Vid2VidL是基于Stable Diffusion设计而得,其设计细节延续我们自研VideoComposer,具体可以参考其技术报告。如下示例中,左边是低分(448 * 256),细节会存在抖动,时序一致性较差 右边是高分(1280 * 720),总体会平滑很多,在很多case具有较强的修正能力。

MS-Vid2Vid-XL is designed based on Stable Diffusion, with design details inherited from our in-house VideoComposer. For specific information, please refer to our technical report.









代码范例 (Code example)

from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys

# VID_PATH: your video path
# TEXT : your text description
pipe = pipeline(task="video-to-video", model='damo/Video-to-Video')
p_input = {
            'video_path': VID_PATH,
            'text': TEXT
        }

output_video_path = pipe(p_input, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]

模型局限 (Limitation)

MS-Vid2Vid-XL可能存在如下可能局限性:

  • 目标距离较远时可能会存在一定的模糊,该问题可以通过输入文本来解决或缓解;
  • 计算时耗大,因为需要生成720P的视频,隐空间的尺寸为(160 * 90),单个视频计算时长>2分钟
  • 目前仅支持英文,因为训练数据的原因目前仅支持英文输入

This MS-Vid2Vid-XL may have the following limitations:

  • There may be some blurriness when the target is far away. This issue can be addressed by providing input text.
  • Computation time is high due to the need to generate 720P videos. The latent space size is (160 * 90), and the computation time for a single video is more than 2 minutes.
  • Currently, it only supports English. This is due to the training data, which is limited to English inputs at the moment.

相关论文以及引用信息 (Reference)

@article{videocomposer2023,
  title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
  author={Wang, Xiang* and Yuan, Hangjie* and Zhang, Shiwei* and Chen, Dayou* and Wang, Jiuniu and Zhang, Yingya and Shen, Yujun and Zhao, Deli and Zhou, Jingren},
  journal={arXiv preprint arXiv:2306.02018},
  year={2023}
}

@inproceedings{videofusion2023,   
  title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},   
  author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},   
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},   
  year={2023}   
}

使用协议 (License Agreement)

我们的代码和模型权重仅可用于个人/学术研究,暂不支持商用。

Our code and model weights are only available for personal/academic research use and are currently not supported for commercial use.