Spaces:

BanglaLLM
/

README

Running

Datasets for Base and Instruction Models

by meherajj - opened Oct 11, 2024

Oct 11, 2024

I noticed from the titles of the models that you’ve used datasets like uonlp-culturax and Bangla 2B+ BERT for base models, and the bangla-alpaca-orca dataset for instruction models. Could you specify the portions of these datasets that were used for training? Also, did you rely solely on the bangla-alpaca-orca dataset for instruction tuning?

brishtiteveja

Bangla Large Language Model org Oct 11, 2024

I believe, so far, we have only 1 model that uses Bangla 2B+ BERT. But, we are primarily focusing on the uonlp-culturaX. For pretraining, I think we used the 100% of bn subset (about 12.4 mil text rows). For finetune with bangla-alpaca-orca, we use 5% for validation and rest for instruction finetuning/training.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment