sail
/

Text Generation
Transformers
English
llama
regmix
Inference Endpoints
SivilTaram's picture
Update README.md
ca76e4b verified
metadata
license: mit
datasets:
  - sail/regmix-data
  - sail/regmix-data-sample
language:
  - en
tags:
  - regmix

Models Trained with Human Selection

This is a collection of the language models trained using Pile-CC, each with approximately 1B parameters, trained on different seeds. This project aims to validate the generalization capabilities of the RegMix approach (https://huggingface.co./papers/2407.01492) from small-scale (e.g., 1M parameters) to large-scale (e.g., 1B parameters) models.

Key Features

  • Model Size: 5 separate models trained with different seeds, each with ~1B parameters
  • Training Data: The pile-cc only data mixture on the RegMix-Data dataset

Dataset

The models were trained using the RegMix-Data dataset, which is split into different domains from The Pile dataset.

Training Hyperparameters

Hyperparameter Value
Batch Size 1M tokens
Learning Rate 4e-4
Minimum Learning Rate 1e-5
Learning Rate Schedule Cosine
Warmup Ratio 4%
Total Tokens 25B

How to Load a Model

You can load any model using the corresponding branch with the Hugging Face Transformers library:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("sail/data-mixture-pile-cc-1b", revision="seed-1")
tokenizer = AutoTokenizer.from_pretrained("sail/data-mixture-pile-cc-1b", revision="seed-1")

Data Mixture

The specific data mixture used for training this 1B model is as follows, which can be also found in our code:

train:
  train_the_pile_pile_cc: 1.0
valid:
  valid_the_pile_pile_cc: 1.0
model_name: tinyllama_1_1b

Model Variants

To access different model variants, simply change the revision parameter in the from_pretrained method to the desired seed (e.g., "seed-2", "seed-3"), and the maxium seed is 5.

Model Performance

We evaluated each model using lm-evaluation-harness. The performance metric for each task is the average of 0-shot to 5-shot accnorm (accuracy normalized, if available) or acc (accuracy) scores.

Seed PIQA LAMBADA MultiRC LogiQA SocialIQA Winogrande RACE OpenBookQA COPA HellaSwag SciQ ARC Easy QQP Average
1 69.23 33.16 50.33 27.57 33.22 52.10 31.80 31.07 65.83 44.15 81.77 51.80 57.04 48.39
2 68.62 33.69 53.15 25.13 32.96 51.24 31.06 30.84 69.80 43.28 83.18 52.00 58.06 48.69
3 69.04 35.68 52.38 26.36 33.45 51.95 30.83 30.16 66.80 42.80 83.32 51.57 57.69 48.62
4 69.35 33.56 50.01 26.24 33.62 50.99 31.81 30.44 65.60 43.00 83.00 52.33 56.14 48.16
5 67.91 35.09 49.93 27.50 33.90 52.85 31.77 30.04 69.40 42.62 80.94 51.25 61.03 48.79

Usage Notes

  • These models are primarily intended for research purposes.
  • Performance may vary depending on the specific task and domain.

Citation

If you use these models in your research, please cite the RegMix paper:

@article{liu2024regmix,
  title={RegMix: Data Mixture as Regression for Language Model Pre-training},
  author={Liu, Qian and Zheng, Xiaosen and Muennighoff, Niklas and Zeng, Guangtao and Dou, Longxu and Pang, Tianyu and Jiang, Jing and Lin, Min},
  journal={arXiv preprint arXiv:2407.01492},
  year={2024}
}

For more information about the RegMix methodology and its applications, please refer to the original paper.