metadata

backbone:
  - diffusion
domain:
  - multi-modal
frameworks:
  - pytorch
license: cc-by-nc-nd-4.0
metrics:
  - realism
  - video-video similarity
studios:
  - damo/Video-to-Video
tags:
  - video2video generation
  - diffusion model
  - 视频到视频
  - 视频超分辨率
  - 视频生成视频
  - 生成
tasks:
  - video-to-video
widgets:
  - examples:
      - inputs:
          - data: A panda eating bamboo on a rock.
            name: text
          - data: XXX/test.mpt
            name: video_path
        name: 2
        title: 示例1
    inferencespec:
      cpu: 4
      gpu: 1
      gpu_memory: 28000
      memory: 32000
    inputs:
      - name: text, video_path
        title: 输入英文prompt, 视频路径
        type: str, str
        validator:
          max_words: 75, /
    task: video-to-video

Video-to-Video

本项目MS-Vid2Vid由达摩院研发和训练，主要用于提升文生视频、图生视频的分辨率和时空连续性，其训练数据包含了精选的海量的高清视频、图像数据（最短边>720），可以将低分辨率的(16:9)的视频提升到更高分辨率（1280 * 720），可以用于任意低分辨率的的超分，本页面我们将称之为MS-Vid2Vid-XL。

The MS-Vid2Vid project is developed and trained by Damo Academy and is primarily used to enhance the resolution and spatiotemporal continuity of text-generated videos and image-generated videos. The training data consists of a large selection of high-definition videos and image data (with a minimum short side length of 720), which can upscale low-resolution (16:9) videos to higher resolutions (1280 * 720). It can be used for arbitrary low-resolution super-resolution tasks. On this page, we refer to it as MS-Vid2Vid-XL.

Fig.1 Video-to-Video-XL

模型介绍 (Introduction)

MS-Vid2VidL是基于Stable Diffusion设计而得，其设计细节延续我们自研VideoComposer，具体可以参考其技术报告。如下示例中，左边是低分(448 * 256)，细节会存在抖动，时序一致性较差右边是高分(1280 * 720)，总体会平滑很多，在很多case具有较强的修正能力。

MS-Vid2Vid-XL is designed based on Stable Diffusion, with design details inherited from our in-house VideoComposer. For specific information, please refer to our technical report.

代码范例 (Code example)

from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys

# VID_PATH: your video path
# TEXT : your text description
pipe = pipeline(task="video-to-video", model='damo/Video-to-Video')
p_input = {
            'video_path': VID_PATH,
            'text': TEXT
        }

output_video_path = pipe(p_input, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]

模型局限 (Limitation)

本MS-Vid2Vid-XL可能存在如下可能局限性：

目标距离较远时可能会存在一定的模糊，该问题可以通过输入文本来解决或缓解；
计算时耗大，因为需要生成720P的视频，隐空间的尺寸为(160 * 90)，单个视频计算时长>2分钟
目前仅支持英文，因为训练数据的原因目前仅支持英文输入

This MS-Vid2Vid-XL may have the following limitations:

There may be some blurriness when the target is far away. This issue can be addressed by providing input text.
Computation time is high due to the need to generate 720P videos. The latent space size is (160 * 90), and the computation time for a single video is more than 2 minutes.
Currently, it only supports English. This is due to the training data, which is limited to English inputs at the moment.

使用协议 (License Agreement)

我们的代码和模型权重仅可用于个人/学术研究，暂不支持商用。

Our code and model weights are only available for personal/academic research use and are currently not supported for commercial use.

ali-vilab
/

MS-Vid2Vid-XL

Video-to-Video

模型介绍 (Introduction)

代码范例 (Code example)

模型局限 (Limitation)

相关论文以及引用信息 (Reference)

使用协议 (License Agreement)