aoxo
/

Image-to-Image
English
art
aoxo commited on
Commit
275979b
1 Parent(s): d89d341

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -1
README.md CHANGED
@@ -254,7 +254,98 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
254
  RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
255
  The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
256
 
257
- **Architecture:** The latest model, v2_1, introduces Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The three other predecessors attained a similar level of accuracy without the LbMhA layers. The general architecture is as follows:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
258
 
259
  ```python
260
  RealFormerv3(
 
254
  RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
255
  The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
256
 
257
+ **Flagship Architecture v4:** The v4 model builds upon the previous version by introducing **Attention Guided Attention (AGA)**, which leverages learned attention weights from a motion-guided cross-attention preprocessing stage. These pre-learned weights, conditioned into the untrained attention mechanism, improve the model's ability to focus on dynamic regions within consecutive frames. Additionally, v4 continues to incorporate **Style Adaptive Layer Normalization (SALN)** to enhance feature extraction. This architecture significantly improves temporal coherence and photorealistic enhancement by transferring knowledge from motion vector-based attention, without retraining the learned weights, leading to more efficient training and better performance in capturing real-world dynamics.
258
+
259
+ ```python
260
+ RealFormerAGA(
261
+ (patch_embed): DynamicPatchEmbedding(
262
+ (proj): Conv2d(2048, 768, kernel_size=(1, 1), stride=(1, 1))
263
+ )
264
+ (encoder_layers): ModuleList(
265
+ (0-15): 16 x TransformerEncoderBlock(
266
+ (attn): CrossAttentionLayer(
267
+ (attn): MultiheadAttention(
268
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
269
+ )
270
+ (dropout): Dropout(p=0.1, inplace=False)
271
+ )
272
+ (ff): Sequential(
273
+ (0): Linear(in_features=768, out_features=3072, bias=True)
274
+ (1): ReLU()
275
+ (2): Linear(in_features=3072, out_features=768, bias=True)
276
+ )
277
+ (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
278
+ (norm2): StyleAdaptiveLayerNorm(
279
+ (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
280
+ (fc): Linear(in_features=768, out_features=1536, bias=True)
281
+ )
282
+ (dropout): Dropout(p=0.1, inplace=False)
283
+ )
284
+ )
285
+ (decoder_layers): ModuleList(
286
+ (0-15): 16 x TransformerDecoderBlock(
287
+ (attn1): CrossAttentionLayer(
288
+ (attn): MultiheadAttention(
289
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
290
+ )
291
+ (dropout): Dropout(p=0.1, inplace=False)
292
+ )
293
+ (attn2): CrossAttentionLayer(
294
+ (attn): MultiheadAttention(
295
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
296
+ )
297
+ (dropout): Dropout(p=0.1, inplace=False)
298
+ )
299
+ (ff): Sequential(
300
+ (0): Linear(in_features=768, out_features=3072, bias=True)
301
+ (1): ReLU()
302
+ (2): Linear(in_features=3072, out_features=768, bias=True)
303
+ )
304
+ (norm1): StyleAdaptiveLayerNorm(
305
+ (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
306
+ (fc): Linear(in_features=768, out_features=1536, bias=True)
307
+ )
308
+ (norm2): StyleAdaptiveLayerNorm(
309
+ (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
310
+ (fc): Linear(in_features=768, out_features=1536, bias=True)
311
+ )
312
+ (norm3): StyleAdaptiveLayerNorm(
313
+ (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
314
+ (fc): Linear(in_features=768, out_features=1536, bias=True)
315
+ )
316
+ )
317
+ )
318
+ (swin_layers): ModuleList(
319
+ (0-15): 16 x SwinTransformerBlock(
320
+ (attn): MultiheadAttention(
321
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
322
+ )
323
+ (mlp): Sequential(
324
+ (0): Linear(in_features=768, out_features=3072, bias=True)
325
+ (1): GELU(approximate='none')
326
+ (2): Linear(in_features=3072, out_features=768, bias=True)
327
+ )
328
+ (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
329
+ (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
330
+ )
331
+ )
332
+ (refinement): RefinementBlock(
333
+ (conv): Conv2d(768, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
334
+ (bn): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
335
+ (relu): ReLU(inplace=True)
336
+ )
337
+ (final_layer): Conv2d(3, 2048, kernel_size=(1, 1), stride=(1, 1))
338
+ (style_encoder): Sequential(
339
+ (0): Conv2d(2048, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
340
+ (1): ReLU()
341
+ (2): AdaptiveAvgPool2d(output_size=1)
342
+ (3): Flatten(start_dim=1, end_dim=-1)
343
+ (4): Linear(in_features=768, out_features=768, bias=True)
344
+ )
345
+ )
346
+ ```
347
+
348
+ **v3 Architecture:** The v3 model introduces Style Adaptive Layer Normalization (SALN) & Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The two other predecessors attained a similar level of accuracy without the LbMhA layers, but with SALN, outperformed by upto ~13%. The general architecture is as follows:
349
 
350
  ```python
351
  RealFormerv3(