ajayj commited on
Commit
526dc7a
·
verified ·
1 Parent(s): e1dc92d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -4
README.md CHANGED
@@ -108,14 +108,14 @@ To learn more check out the [Diffusers](https://huggingface.co/docs/diffusers/ma
108
 
109
  Mochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained entirely from scratch, it is the largest video generative model ever openly released. And best of all, it’s a simple, hackable architecture.
110
 
111
- Alongside Mochi, we are open-sourcing our video VAE. Our VAE causally compresses videos to a 128x smaller size, with an 8x8 spatial and a 6x temporal compression to a 12-channel latent space.
112
 
113
  An AsymmDiT efficiently processes user prompts alongside compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. AsymmDiT jointly attends to text and visual tokens with multi-modal self-attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. However, our visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. To unify the modalities in self-attention, we use non-square QKV and output projection layers. This asymmetric design reduces inference memory requirements.
114
  Many modern diffusion models use multiple pretrained language models to represent user prompts. In contrast, Mochi 1 simply encodes prompts with a single T5-XXL language model.
115
 
116
  ## Hardware Requirements
117
 
118
- The model requires at least 4 H100 GPUs to run. We welcome contributions from the community to reduce this requirement.
119
 
120
  ## Safety
121
  Genmo video models are general text-to-video diffusion models that inherently reflect the biases and preconceptions found in their training data. While steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in any commercial services or products.
@@ -127,8 +127,11 @@ Under the research preview, Mochi 1 is a living and evolving checkpoint. There a
127
  ## BibTeX
128
  ```
129
  @misc{genmo2024mochi,
130
- title={Mochi},
131
  author={Genmo Team},
132
- year={2024}
 
 
 
133
  }
134
  ```
 
108
 
109
  Mochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained entirely from scratch, it is the largest video generative model ever openly released. And best of all, it’s a simple, hackable architecture.
110
 
111
+ Alongside Mochi, we are open-sourcing our video VAE. Our VAE causally compresses videos to a 96x smaller size, with an 8x8 spatial and a 6x temporal compression to a 12-channel latent space.
112
 
113
  An AsymmDiT efficiently processes user prompts alongside compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. AsymmDiT jointly attends to text and visual tokens with multi-modal self-attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. However, our visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. To unify the modalities in self-attention, we use non-square QKV and output projection layers. This asymmetric design reduces inference memory requirements.
114
  Many modern diffusion models use multiple pretrained language models to represent user prompts. In contrast, Mochi 1 simply encodes prompts with a single T5-XXL language model.
115
 
116
  ## Hardware Requirements
117
 
118
+ Mochi 1 supports a variety of hardware platforms depending on quantization level, ranging from a single 3090 GPU up to multiple H100 GPUs.
119
 
120
  ## Safety
121
  Genmo video models are general text-to-video diffusion models that inherently reflect the biases and preconceptions found in their training data. While steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in any commercial services or products.
 
127
  ## BibTeX
128
  ```
129
  @misc{genmo2024mochi,
130
+ title={Mochi 1},
131
  author={Genmo Team},
132
+ year={2024},
133
+ publisher = {GitHub},
134
+ journal = {GitHub repository},
135
+ howpublished={\url{https://github.com/genmoai/models}}
136
  }
137
  ```