aoxo
/

RealFormer

Image-to-Image

English

art

Model card Files Files and versions Community

aoxo commited on Oct 13

Commit

7d76a88

•

1 Parent(s): 6fb541c

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -5

README.md CHANGED Viewed

@@ -100,7 +100,9 @@ visualize_tensor(output, "Output Image")
 ### Training Data
-The model was trained on [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling) and then the decoder layers were frozen to finetune it on the [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v). The former includes over 400,000 frames of footage from video games such as WatchDogs 2, Grand Theft Auto V, CyberPunk, several Hollywood films and high-defintion photos. The latter comprises of ~25,000 high-definition semantic segmentation map - rendered frame pairs captured from Grand Theft Auto V in-game and a UNet based Semantic Segmentation Model.
 ### Training Procedure
@@ -124,7 +126,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
 #### Training Hyperparameters
 **v1**
-- Precision: fp32
 - Embedded dimensions: 768
 - Hidden dimensions: 3072
 - Attention Type: Linear Attention
@@ -139,7 +141,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
 - Style Transfer Module: AdaIN (Adaptive Instance Normalization)
 **v2**
-- Precision: fp32
 - Embedded dimensions: 768
 - Hidden dimensions: 3072
 - Attention Type: Location-Based Multi-Head Attention (Linear Attention)
@@ -172,7 +174,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
 - Precision: FP32, FP16, BF16, INT8
 - Embedding Dimensions: 768
 - Hidden Dimensions: 3072
-- Attention Type: Location-Based Multi-Head Attention (Linear Attention) and Cross-Attention (pretrained attention-conditioned)
 - Number of Attention Heads: 32
 - Number of Attention Layers: 16
 - Number of Transformer Encoder Layers (Feed-Forward): 16
@@ -182,10 +184,11 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
 - Swin Window Size: 7
 - Swin Shift Size: 2
 - Style Transfer Module: Style Adaptive Layer Normalization (SALN)
 #### Speeds, Sizes, Times
-**Model size:** There are currently five versions of the model:
 - v1_1: 224M params
 - v1_2: 200M params
 - v1_3: 93M params

 ### Training Data
+- **Preliminary:** The model was trained on [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling) and then the decoder layers were frozen to finetune it on the [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v). The former includes over 400,000 frames of footage from over 9 video games such as WatchDogs 2, Grand Theft Auto V, CyberPunk, several Hollywood films and high-defintion photos. The latter comprises of ~25,000 high-definition semantic segmentation map - rendered frame pairs captured from Grand Theft Auto V in-game and a UNet based Semantic Segmentation Model.
+- **Latest:** The latest model was trained purely on [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v), a composition of over 1.24 billion realworld images and over 117 million in-game captured frames.
 ### Training Procedure
 #### Training Hyperparameters
 **v1**
+- Precision: FP32
 - Embedded dimensions: 768
 - Hidden dimensions: 3072
 - Attention Type: Linear Attention
 - Style Transfer Module: AdaIN (Adaptive Instance Normalization)
 **v2**
+- Precision: FP32
 - Embedded dimensions: 768
 - Hidden dimensions: 3072
 - Attention Type: Location-Based Multi-Head Attention (Linear Attention)
 - Precision: FP32, FP16, BF16, INT8
 - Embedding Dimensions: 768
 - Hidden Dimensions: 3072
+- Attention Type: Location-Based Multi-Head Attention (Linear Attention) and Cross-Attention (Pretrained Attention-Guided)
 - Number of Attention Heads: 32
 - Number of Attention Layers: 16
 - Number of Transformer Encoder Layers (Feed-Forward): 16
 - Swin Window Size: 7
 - Swin Shift Size: 2
 - Style Transfer Module: Style Adaptive Layer Normalization (SALN)
+- Style Encoder: Custom MultiScale Style Encoder
 #### Speeds, Sizes, Times
+**Model size:** There are currently four definitive versions of the model:
 - v1_1: 224M params
 - v1_2: 200M params
 - v1_3: 93M params