metadata

license: apache-2.0
datasets:
  - togethercomputer/RedPajama-Data-1T
language:
  - en
base_model:
  - KoboldAI/fairseq-dense-125M

Data Scorer

The model to score data for data selection in the paper Data Selection via Optimal Learning for Language Models. To use the model, follow the instructions here.

NOTE: you may need to download the fairseq-125M to ${PATH_TO_DATA_SELECTION_REPO}/checkpoints/fairseq/125M to prepare the tokenizer and config.json for the base model.

Citation

@article{gu2024data,
  title={Data Selection via Optimal Control for Language Models},
  author={Gu, Yuxian and Dong, Li and Wang, Hongning and Hao, Yaru and Dong, Qingxiu and Wei, Furu and Huang, Minlie},
  journal={arXiv preprint arXiv:2410.07064},
  year={2024}
}