Context-aware Biases for Length Extrapolation
The source code of (Context-aware Biases for Length Extrapolation)
π News
- [2025.02.3] Code release
Upcoming
- Cleaning codebase
- Adding scripts for training ALiBi, RoPE, T5-bias
Datasets and Models
Download the datasets from HuggingFace and use use dataset_preparation.py
for saving tokenized dataset.
Some of trained models:
Dataset | Model | Parameters | Sequence Length | Checkpoint |
---|---|---|---|---|
Fineweb-Edu(10B) | GPT-Medium | 334M | 1024 | Link |
Fineweb-Edu(10B) | GPT-Medium | 334M | 512 | Link |
WikiText-103 | GPT-Tiny | 44M | 1024 | Link |
WikiText-103 | GPT-Tiny | 44M | 512 | Link |
Training
Single GPU
python Cable.py --dataset-dir "path to dataset" --model "medium or small or tiny" --save-dir "dir for logs"
Multiple GPUs
torchrun --standalone --nproc_per_node=2 Cable.py
For Hellaswag benchmark and evaluating extrapolation please use evaluation.ipynb
notebook.
Length Extrapolation
A Cable model trained on T=1024 can extrapolate on T=8192, achieving a better performance (PPL=22.22) compared to the sinusoidal model (PPL=22.81) trained on T=8192.
Runtime and Memory Overhead
Cable improves the model's extrapolation ability significantly with a negligible burden in time and memory compared to the vanilla transformer. Furthermore, compared to existing RPE methods, our approach maintains nearly identical training time and GPU memory usage, while its inference overhead remains either negligible or comparable, depending on the sequence length.
Citation
If you use this repository for your research or wish to refer to our distillation method, please use the following BibTeX entry:
Acknowledgement
This repo is based on Karpathy/Build-NanoGPT. Thanks for their excellent work.
Model tree for axiomlaborg/Cable
Base model
openai-community/gpt2