Update README.md
Browse files
README.md
CHANGED
@@ -100,7 +100,9 @@ visualize_tensor(output, "Output Image")
|
|
100 |
|
101 |
### Training Data
|
102 |
|
103 |
-
The model was trained on [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling) and then the decoder layers were frozen to finetune it on the [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v). The former includes over 400,000 frames of footage from video games such as WatchDogs 2, Grand Theft Auto V, CyberPunk, several Hollywood films and high-defintion photos. The latter comprises of ~25,000 high-definition semantic segmentation map - rendered frame pairs captured from Grand Theft Auto V in-game and a UNet based Semantic Segmentation Model.
|
|
|
|
|
104 |
|
105 |
### Training Procedure
|
106 |
|
@@ -124,7 +126,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
124 |
#### Training Hyperparameters
|
125 |
|
126 |
**v1**
|
127 |
-
- Precision:
|
128 |
- Embedded dimensions: 768
|
129 |
- Hidden dimensions: 3072
|
130 |
- Attention Type: Linear Attention
|
@@ -139,7 +141,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
139 |
- Style Transfer Module: AdaIN (Adaptive Instance Normalization)
|
140 |
|
141 |
**v2**
|
142 |
-
- Precision:
|
143 |
- Embedded dimensions: 768
|
144 |
- Hidden dimensions: 3072
|
145 |
- Attention Type: Location-Based Multi-Head Attention (Linear Attention)
|
@@ -172,7 +174,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
172 |
- Precision: FP32, FP16, BF16, INT8
|
173 |
- Embedding Dimensions: 768
|
174 |
- Hidden Dimensions: 3072
|
175 |
-
- Attention Type: Location-Based Multi-Head Attention (Linear Attention) and Cross-Attention (
|
176 |
- Number of Attention Heads: 32
|
177 |
- Number of Attention Layers: 16
|
178 |
- Number of Transformer Encoder Layers (Feed-Forward): 16
|
@@ -182,10 +184,11 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
182 |
- Swin Window Size: 7
|
183 |
- Swin Shift Size: 2
|
184 |
- Style Transfer Module: Style Adaptive Layer Normalization (SALN)
|
|
|
185 |
|
186 |
#### Speeds, Sizes, Times
|
187 |
|
188 |
-
**Model size:** There are currently
|
189 |
- v1_1: 224M params
|
190 |
- v1_2: 200M params
|
191 |
- v1_3: 93M params
|
|
|
100 |
|
101 |
### Training Data
|
102 |
|
103 |
+
- **Preliminary:** The model was trained on [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling) and then the decoder layers were frozen to finetune it on the [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v). The former includes over 400,000 frames of footage from over 9 video games such as WatchDogs 2, Grand Theft Auto V, CyberPunk, several Hollywood films and high-defintion photos. The latter comprises of ~25,000 high-definition semantic segmentation map - rendered frame pairs captured from Grand Theft Auto V in-game and a UNet based Semantic Segmentation Model.
|
104 |
+
|
105 |
+
- **Latest:** The latest model was trained purely on [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v), a composition of over 1.24 billion realworld images and over 117 million in-game captured frames.
|
106 |
|
107 |
### Training Procedure
|
108 |
|
|
|
126 |
#### Training Hyperparameters
|
127 |
|
128 |
**v1**
|
129 |
+
- Precision: FP32
|
130 |
- Embedded dimensions: 768
|
131 |
- Hidden dimensions: 3072
|
132 |
- Attention Type: Linear Attention
|
|
|
141 |
- Style Transfer Module: AdaIN (Adaptive Instance Normalization)
|
142 |
|
143 |
**v2**
|
144 |
+
- Precision: FP32
|
145 |
- Embedded dimensions: 768
|
146 |
- Hidden dimensions: 3072
|
147 |
- Attention Type: Location-Based Multi-Head Attention (Linear Attention)
|
|
|
174 |
- Precision: FP32, FP16, BF16, INT8
|
175 |
- Embedding Dimensions: 768
|
176 |
- Hidden Dimensions: 3072
|
177 |
+
- Attention Type: Location-Based Multi-Head Attention (Linear Attention) and Cross-Attention (Pretrained Attention-Guided)
|
178 |
- Number of Attention Heads: 32
|
179 |
- Number of Attention Layers: 16
|
180 |
- Number of Transformer Encoder Layers (Feed-Forward): 16
|
|
|
184 |
- Swin Window Size: 7
|
185 |
- Swin Shift Size: 2
|
186 |
- Style Transfer Module: Style Adaptive Layer Normalization (SALN)
|
187 |
+
- Style Encoder: Custom MultiScale Style Encoder
|
188 |
|
189 |
#### Speeds, Sizes, Times
|
190 |
|
191 |
+
**Model size:** There are currently four definitive versions of the model:
|
192 |
- v1_1: 224M params
|
193 |
- v1_2: 200M params
|
194 |
- v1_3: 93M params
|