Papers
arxiv:2502.20388

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Published on Feb 27
· Submitted by OliverRen on Feb 28
Authors:
,
,
,
,
,

Abstract

Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a ktimes k grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20times faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2times faster than the previous best-performing model without relying on vision foundation modules (\eg, DINOv2) or advanced guidance interval sampling.

Community

Paper submitter

xAR generalizes next-token prediction to next-X prediction.
xAR finds that next-cell prediction yields the best performance by capturing richer spatial-semantic relationships.
xAR proposes Noisy Context Learning (NCL) by deliberately exposing the model to noisy contexts during training to address the problems in teacher forcing that during inference, errors accumulate over time.
xAR-B (172M) outperforms the large DiT-XL and SiT-XL (675M) while achieving 20× faster inference. Additionally, the largest model, xAR-H (1.1B), sets a new state-of-the-art with an FID of 1.24 on ImageNet-256, without relying on vision foundation models (e.g., DINOv2) or extra guidance interval sampling.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.20388 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.20388 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.20388 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.