arxiv:2410.01600

ENTP: Encoder-only Next Token Prediction

Published on Oct 2

Authors:

Ethan Ewer ,

Abstract

Next-token prediction models have predominantly relied on decoder-only Transformers with causal attention, driven by the common belief that causal attention is essential to prevent "cheating" by masking future tokens. We challenge this widely accepted notion and argue that this design choice is about efficiency rather than necessity. While decoder-only Transformers are still a good choice for practical reasons, they are not the only viable option. In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP. We introduce the Triplet-Counting task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate ENTP's superior performance across various realistic tasks, such as length generalization and in-context learning.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.01600 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.01600 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.01600 in a Space README.md to link it from this page.