aoxo
/

Image-to-Image
English
art
aoxo commited on
Commit
7d76a88
1 Parent(s): 6fb541c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -5
README.md CHANGED
@@ -100,7 +100,9 @@ visualize_tensor(output, "Output Image")
100
 
101
  ### Training Data
102
 
103
- The model was trained on [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling) and then the decoder layers were frozen to finetune it on the [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v). The former includes over 400,000 frames of footage from video games such as WatchDogs 2, Grand Theft Auto V, CyberPunk, several Hollywood films and high-defintion photos. The latter comprises of ~25,000 high-definition semantic segmentation map - rendered frame pairs captured from Grand Theft Auto V in-game and a UNet based Semantic Segmentation Model.
 
 
104
 
105
  ### Training Procedure
106
 
@@ -124,7 +126,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
124
  #### Training Hyperparameters
125
 
126
  **v1**
127
- - Precision: fp32
128
  - Embedded dimensions: 768
129
  - Hidden dimensions: 3072
130
  - Attention Type: Linear Attention
@@ -139,7 +141,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
139
  - Style Transfer Module: AdaIN (Adaptive Instance Normalization)
140
 
141
  **v2**
142
- - Precision: fp32
143
  - Embedded dimensions: 768
144
  - Hidden dimensions: 3072
145
  - Attention Type: Location-Based Multi-Head Attention (Linear Attention)
@@ -172,7 +174,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
172
  - Precision: FP32, FP16, BF16, INT8
173
  - Embedding Dimensions: 768
174
  - Hidden Dimensions: 3072
175
- - Attention Type: Location-Based Multi-Head Attention (Linear Attention) and Cross-Attention (pretrained attention-conditioned)
176
  - Number of Attention Heads: 32
177
  - Number of Attention Layers: 16
178
  - Number of Transformer Encoder Layers (Feed-Forward): 16
@@ -182,10 +184,11 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
182
  - Swin Window Size: 7
183
  - Swin Shift Size: 2
184
  - Style Transfer Module: Style Adaptive Layer Normalization (SALN)
 
185
 
186
  #### Speeds, Sizes, Times
187
 
188
- **Model size:** There are currently five versions of the model:
189
  - v1_1: 224M params
190
  - v1_2: 200M params
191
  - v1_3: 93M params
 
100
 
101
  ### Training Data
102
 
103
+ - **Preliminary:** The model was trained on [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling) and then the decoder layers were frozen to finetune it on the [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v). The former includes over 400,000 frames of footage from over 9 video games such as WatchDogs 2, Grand Theft Auto V, CyberPunk, several Hollywood films and high-defintion photos. The latter comprises of ~25,000 high-definition semantic segmentation map - rendered frame pairs captured from Grand Theft Auto V in-game and a UNet based Semantic Segmentation Model.
104
+
105
+ - **Latest:** The latest model was trained purely on [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v), a composition of over 1.24 billion realworld images and over 117 million in-game captured frames.
106
 
107
  ### Training Procedure
108
 
 
126
  #### Training Hyperparameters
127
 
128
  **v1**
129
+ - Precision: FP32
130
  - Embedded dimensions: 768
131
  - Hidden dimensions: 3072
132
  - Attention Type: Linear Attention
 
141
  - Style Transfer Module: AdaIN (Adaptive Instance Normalization)
142
 
143
  **v2**
144
+ - Precision: FP32
145
  - Embedded dimensions: 768
146
  - Hidden dimensions: 3072
147
  - Attention Type: Location-Based Multi-Head Attention (Linear Attention)
 
174
  - Precision: FP32, FP16, BF16, INT8
175
  - Embedding Dimensions: 768
176
  - Hidden Dimensions: 3072
177
+ - Attention Type: Location-Based Multi-Head Attention (Linear Attention) and Cross-Attention (Pretrained Attention-Guided)
178
  - Number of Attention Heads: 32
179
  - Number of Attention Layers: 16
180
  - Number of Transformer Encoder Layers (Feed-Forward): 16
 
184
  - Swin Window Size: 7
185
  - Swin Shift Size: 2
186
  - Style Transfer Module: Style Adaptive Layer Normalization (SALN)
187
+ - Style Encoder: Custom MultiScale Style Encoder
188
 
189
  #### Speeds, Sizes, Times
190
 
191
+ **Model size:** There are currently four definitive versions of the model:
192
  - v1_1: 224M params
193
  - v1_2: 200M params
194
  - v1_3: 93M params