Haoxiang-Wang
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -63,7 +63,7 @@ Under the NVIDIA Open Model License, NVIDIA confirms:
|
|
63 |
|
64 |
## Model Architecture:
|
65 |
|
66 |
-
We designed Cosmos Tokenizer using a lightweight and computationally efficient architecture, featuring a temporally causal design. Specifically, we employ causal temporal convolution and causal temporal attention layers to preserve the natural temporal order of video frames, ensuring seamless tokenization of images and videos using a single unified network architecture. The encoder and decoder form a symmetrical pair, which are mirrors of each other. The encoder starts with a 2-level [Haar wavelet](https://link.springer.com/book/10.1007/978-3-319-04295-4) transform layer, which down-samples inputs by a factor of 4 in both spatial and temporal dimensions. Likewise, the decoder ends with an inverse wavelet transform. We employ the vanilla autoencoder (AE) formulation to model the latent space for continuous tokenizers. For discrete tokenizers, we adopt the [Finite-Scalar-Quantization](https://
|
67 |
|
68 |
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/638fb8cf2380ffd99caf8c2a/gQH5n9iCEtqZc7uutUwdL.jpeg)
|
69 |
|
|
|
63 |
|
64 |
## Model Architecture:
|
65 |
|
66 |
+
We designed Cosmos Tokenizer using a lightweight and computationally efficient architecture, featuring a temporally causal design. Specifically, we employ causal temporal convolution and causal temporal attention layers to preserve the natural temporal order of video frames, ensuring seamless tokenization of images and videos using a single unified network architecture. The encoder and decoder form a symmetrical pair, which are mirrors of each other. The encoder starts with a 2-level [Haar wavelet](https://link.springer.com/book/10.1007/978-3-319-04295-4) transform layer, which down-samples inputs by a factor of 4 in both spatial and temporal dimensions. Likewise, the decoder ends with an inverse wavelet transform. We employ the vanilla autoencoder (AE) formulation to model the latent space for continuous tokenizers. For discrete tokenizers, we adopt the [Finite-Scalar-Quantization](https://openreview.net/forum?id=8ishA3LxN8) (FSQ) as the latent space quantizer.
|
67 |
|
68 |
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/638fb8cf2380ffd99caf8c2a/gQH5n9iCEtqZc7uutUwdL.jpeg)
|
69 |
|