Update README.md
Browse files
@@ -254,7 +254,98 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
254 |
RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
255 |
The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
256 |
257 |
**Architecture:** The
258 |
259 |
260 |
254 |
RealFormer is a Transformer-based low-latency Style Transfer Generative LM that attempts to reconstruct each frame into a more photorealistic image.
255 |
The objective of RealFormer is to attain the maximum level of detail to the real-world, which even current video games with exhaustive graphics are not able to.
256 |
257 |
**Flagship Architecture v4:** The v4 model builds upon the previous version by introducing **Attention Guided Attention (AGA)**, which leverages learned attention weights from a motion-guided cross-attention preprocessing stage. These pre-learned weights, conditioned into the untrained attention mechanism, improve the model's ability to focus on dynamic regions within consecutive frames. Additionally, v4 continues to incorporate **Style Adaptive Layer Normalization (SALN)** to enhance feature extraction. This architecture significantly improves temporal coherence and photorealistic enhancement by transferring knowledge from motion vector-based attention, without retraining the learned weights, leading to more efficient training and better performance in capturing real-world dynamics.
258 |
259 |
260 |
261 |
(patch_embed): DynamicPatchEmbedding(
262 |
(proj): Conv2d(2048, 768, kernel_size=(1, 1), stride=(1, 1))
263 |
264 |
(encoder_layers): ModuleList(
265 |
(0-15): 16 x TransformerEncoderBlock(
266 |
(attn): CrossAttentionLayer(
267 |
(attn): MultiheadAttention(
268 |
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
269 |
270 |
(dropout): Dropout(p=0.1, inplace=False)
271 |
272 |
(ff): Sequential(
273 |
(0): Linear(in_features=768, out_features=3072, bias=True)
274 |
(1): ReLU()
275 |
(2): Linear(in_features=3072, out_features=768, bias=True)
276 |
277 |
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
278 |
(norm2): StyleAdaptiveLayerNorm(
279 |
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
280 |
(fc): Linear(in_features=768, out_features=1536, bias=True)
281 |
282 |
(dropout): Dropout(p=0.1, inplace=False)
283 |
284 |
285 |
(decoder_layers): ModuleList(
286 |
(0-15): 16 x TransformerDecoderBlock(
287 |
(attn1): CrossAttentionLayer(
288 |
(attn): MultiheadAttention(
289 |
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
290 |
291 |
(dropout): Dropout(p=0.1, inplace=False)
292 |
293 |
(attn2): CrossAttentionLayer(
294 |
(attn): MultiheadAttention(
295 |
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
296 |
297 |
(dropout): Dropout(p=0.1, inplace=False)
298 |
299 |
(ff): Sequential(
300 |
(0): Linear(in_features=768, out_features=3072, bias=True)
301 |
(1): ReLU()
302 |
(2): Linear(in_features=3072, out_features=768, bias=True)
303 |
304 |
(norm1): StyleAdaptiveLayerNorm(
305 |
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
306 |
(fc): Linear(in_features=768, out_features=1536, bias=True)
307 |
308 |
(norm2): StyleAdaptiveLayerNorm(
309 |
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
310 |
(fc): Linear(in_features=768, out_features=1536, bias=True)
311 |
312 |
(norm3): StyleAdaptiveLayerNorm(
313 |
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
314 |
(fc): Linear(in_features=768, out_features=1536, bias=True)
315 |
316 |
317 |
318 |
(swin_layers): ModuleList(
319 |
(0-15): 16 x SwinTransformerBlock(
320 |
(attn): MultiheadAttention(
321 |
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
322 |
323 |
(mlp): Sequential(
324 |
(0): Linear(in_features=768, out_features=3072, bias=True)
325 |
(1): GELU(approximate='none')
326 |
(2): Linear(in_features=3072, out_features=768, bias=True)
327 |
328 |
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
329 |
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
330 |
331 |
332 |
(refinement): RefinementBlock(
333 |
(conv): Conv2d(768, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
334 |
(bn): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
335 |
(relu): ReLU(inplace=True)
336 |
337 |
(final_layer): Conv2d(3, 2048, kernel_size=(1, 1), stride=(1, 1))
338 |
(style_encoder): Sequential(
339 |
(0): Conv2d(2048, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
340 |
(1): ReLU()
341 |
(2): AdaptiveAvgPool2d(output_size=1)
342 |
(3): Flatten(start_dim=1, end_dim=-1)
343 |
(4): Linear(in_features=768, out_features=768, bias=True)
344 |
345 |
346 |
347 |
348 |
**v3 Architecture:** The v3 model introduces Style Adaptive Layer Normalization (SALN) & Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The two other predecessors attained a similar level of accuracy without the LbMhA layers, but with SALN, outperformed by upto ~13%. The general architecture is as follows:
349 |
350 |
351 |