nvidia
/

Cosmos-0.1-Tokenizer-DV8x8x8

Model card Files Files and versions Community

Haoxiang-Wang commited on Nov 6, 2024

Commit

02a1654

·

verified ·

1 Parent(s): a1b2d85

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -63,7 +63,7 @@ Under the NVIDIA Open Model License, NVIDIA confirms:
 ## Model Architecture:
-We designed Cosmos Tokenizer using a lightweight and computationally efficient architecture, featuring a temporally causal design. Specifically, we employ causal temporal convolution and causal temporal attention layers to preserve the natural temporal order of video frames, ensuring seamless tokenization of images and videos using a single unified network architecture. The encoder and decoder form a symmetrical pair, which are mirrors of each other. The encoder starts with a 2-level [Haar wavelet](https://link.springer.com/book/10.1007/978-3-319-04295-4) transform layer, which down-samples inputs by a factor of 4 in both spatial and temporal dimensions. Likewise, the decoder ends with an inverse wavelet transform. We employ the vanilla autoencoder (AE) formulation to model the latent space for continuous tokenizers. For discrete tokenizers, we adopt the [Finite-Scalar-Quantization](https://arxiv.org/abs/2309.15505) (FSQ) as the latent space quantizer.
 ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/638fb8cf2380ffd99caf8c2a/gQH5n9iCEtqZc7uutUwdL.jpeg)

 ## Model Architecture:
+We designed Cosmos Tokenizer using a lightweight and computationally efficient architecture, featuring a temporally causal design. Specifically, we employ causal temporal convolution and causal temporal attention layers to preserve the natural temporal order of video frames, ensuring seamless tokenization of images and videos using a single unified network architecture. The encoder and decoder form a symmetrical pair, which are mirrors of each other. The encoder starts with a 2-level [Haar wavelet](https://link.springer.com/book/10.1007/978-3-319-04295-4) transform layer, which down-samples inputs by a factor of 4 in both spatial and temporal dimensions. Likewise, the decoder ends with an inverse wavelet transform. We employ the vanilla autoencoder (AE) formulation to model the latent space for continuous tokenizers. For discrete tokenizers, we adopt the [Finite-Scalar-Quantization](https://openreview.net/forum?id=8ishA3LxN8) (FSQ) as the latent space quantizer.
 ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/638fb8cf2380ffd99caf8c2a/gQH5n9iCEtqZc7uutUwdL.jpeg)