Transformer Language Models without Positional Encodings Still Learn Positional Information
Abstract
Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with standard models, and that this phenomenon is robust across different datasets, model sizes, and sequence lengths. Probing experiments reveal that such models acquire an implicit notion of absolute positions throughout the network, effectively compensating for the missing information. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position. Our findings indicate that causal LMs might derive positional awareness not only from the explicit positioning mechanism, but also from the effects of the causal mask.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- How Effective are State Space Models for Machine Translation? (2024)
- Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models (2024)
- Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell (2024)
- T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings (2024)
- Understanding and Mitigating Tokenization Bias in Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper