Datasets for Base and Instruction Models

#1
by meherajj - opened

I noticed from the titles of the models that you’ve used datasets like uonlp-culturax and Bangla 2B+ BERT for base models, and the bangla-alpaca-orca dataset for instruction models. Could you specify the portions of these datasets that were used for training? Also, did you rely solely on the bangla-alpaca-orca dataset for instruction tuning?

Bangla Large Language Model org

I believe, so far, we have only 1 model that uses Bangla 2B+ BERT. But, we are primarily focusing on the uonlp-culturaX. For pretraining, I think we used the 100% of bn subset (about 12.4 mil text rows). For finetune with bangla-alpaca-orca, we use 5% for validation and rest for instruction finetuning/training.

Sign up or log in to comment