aoxo
/

RealFormer

Image-to-Image

English

art

Model card Files Files and versions Community

aoxo commited on Oct 12

Commit

275979b

•

1 Parent(s): d89d341

Update README.md

Browse files

Files changed (1) hide show

README.md +92 -1

README.md CHANGED Viewed

@@ -254,7 +254,98 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
 The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
-**Architecture:** The latest model, v2_1, introduces Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The three other predecessors attained a similar level of accuracy without the LbMhA layers. The general architecture is as follows:
 ```python
 RealFormerv3(

 RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
 The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
+**Flagship Architecture v4:** The v4 model builds upon the previous version by introducing **Attention Guided Attention (AGA)**, which leverages learned attention weights from a motion-guided cross-attention preprocessing stage. These pre-learned weights, conditioned into the untrained attention mechanism, improve the model's ability to focus on dynamic regions within consecutive frames. Additionally, v4 continues to incorporate **Style Adaptive Layer Normalization (SALN)** to enhance feature extraction. This architecture significantly improves temporal coherence and photorealistic enhancement by transferring knowledge from motion vector-based attention, without retraining the learned weights, leading to more efficient training and better performance in capturing real-world dynamics.
+```python
+RealFormerAGA(
+  (patch_embed): DynamicPatchEmbedding(
+    (proj): Conv2d(2048, 768, kernel_size=(1, 1), stride=(1, 1))
+  )
+  (encoder_layers): ModuleList(
+    (0-15): 16 x TransformerEncoderBlock(
+      (attn): CrossAttentionLayer(
+        (attn): MultiheadAttention(
+          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
+        )
+        (dropout): Dropout(p=0.1, inplace=False)
+      )
+      (ff): Sequential(
+        (0): Linear(in_features=768, out_features=3072, bias=True)
+        (1): ReLU()
+        (2): Linear(in_features=3072, out_features=768, bias=True)
+      )
+      (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+      (norm2): StyleAdaptiveLayerNorm(
+        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (fc): Linear(in_features=768, out_features=1536, bias=True)
+      )
+      (dropout): Dropout(p=0.1, inplace=False)
+    )
+  )
+  (decoder_layers): ModuleList(
+    (0-15): 16 x TransformerDecoderBlock(
+      (attn1): CrossAttentionLayer(
+        (attn): MultiheadAttention(
+          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
+        )
+        (dropout): Dropout(p=0.1, inplace=False)
+      )
+      (attn2): CrossAttentionLayer(
+        (attn): MultiheadAttention(
+          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
+        )
+        (dropout): Dropout(p=0.1, inplace=False)
+      )
+      (ff): Sequential(
+        (0): Linear(in_features=768, out_features=3072, bias=True)
+        (1): ReLU()
+        (2): Linear(in_features=3072, out_features=768, bias=True)
+      )
+      (norm1): StyleAdaptiveLayerNorm(
+        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (fc): Linear(in_features=768, out_features=1536, bias=True)
+      )
+      (norm2): StyleAdaptiveLayerNorm(
+        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (fc): Linear(in_features=768, out_features=1536, bias=True)
+      )
+      (norm3): StyleAdaptiveLayerNorm(
+        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (fc): Linear(in_features=768, out_features=1536, bias=True)
+      )
+    )
+  )
+  (swin_layers): ModuleList(
+    (0-15): 16 x SwinTransformerBlock(
+      (attn): MultiheadAttention(
+        (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
+      )
+      (mlp): Sequential(
+        (0): Linear(in_features=768, out_features=3072, bias=True)
+        (1): GELU(approximate='none')
+        (2): Linear(in_features=3072, out_features=768, bias=True)
+      )
+      (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+      (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+    )
+  )
+  (refinement): RefinementBlock(
+    (conv): Conv2d(768, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+    (bn): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
+    (relu): ReLU(inplace=True)
+  )
+  (final_layer): Conv2d(3, 2048, kernel_size=(1, 1), stride=(1, 1))
+  (style_encoder): Sequential(
+    (0): Conv2d(2048, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+    (1): ReLU()
+    (2): AdaptiveAvgPool2d(output_size=1)
+    (3): Flatten(start_dim=1, end_dim=-1)
+    (4): Linear(in_features=768, out_features=768, bias=True)
+  )
+)
+```
+**v3 Architecture:** The v3 model introduces Style Adaptive Layer Normalization (SALN) & Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The two other predecessors attained a similar level of accuracy without the LbMhA layers, but with SALN, outperformed by upto ~13%. The general architecture is as follows:
 ```python
 RealFormerv3(