metadata
license: apache-2.0
datasets:
- togethercomputer/RedPajama-Data-1T
language:
- en
base_model:
- KoboldAI/fairseq-dense-125M
Data Scorer
The model to score data for data selection in the paper Data Selection via Optimal Learning for Language Models. To use the model, follow the instructions here.
NOTE: you may need to download the fairseq-125M to ${PATH_TO_DATA_SELECTION_REPO}/checkpoints/fairseq/125M
to prepare the tokenizer and config.json for the base model.
Citation
@article{gu2024data,
title={Data Selection via Optimal Control for Language Models},
author={Gu, Yuxian and Dong, Li and Wang, Hongning and Hao, Yaru and Dong, Qingxiu and Wei, Furu and Huang, Minlie},
journal={arXiv preprint arXiv:2410.07064},
year={2024}
}