|
--- |
|
license: mit |
|
datasets: |
|
- imagenet-1k |
|
language: |
|
- en |
|
- zh |
|
--- |
|
|
|
|
|
|
|
|
|
# Model Card for VAR (Visual AutoRegressive) Transformers ๐ฅ |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
[![arXiv](https://img.shields.io/badge/arXiv%20papr-2404.02905-b31b1b.svg)](https://arxiv.org/abs/2404.02905)[![demo platform](https://img.shields.io/badge/Play%20with%20VAR%21-VAR%20demo%20platform-lightblue)](https://var.vision/demo) |
|
|
|
VAR is a new visual generation framework that makes GPT-style models surpass diffusion models **for the first time**๐, and exhibits clear power-law Scaling Laws๐ like large language models (LLMs). |
|
|
|
<p align="center"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/60e73ffd06ad9ae5bbcfc52c/FusWBHW8uJgYWO02HFNGz.png" width=93%> |
|
<p> |
|
|
|
VAR redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". |
|
|
|
<p align="center"> |
|
<img src="https://github.com/FoundationVision/VAR/assets/39692511/3e12655c-37dc-4528-b923-ec6c4cfef178" width=93%> |
|
<p> |
|
|
|
|
|
This repo is used for hosting VAR's checkpoints. |
|
|
|
For more details or tutorials see https://github.com/FoundationVision/VAR. |
|
|
|
|
|
|