Datasets

#1
by s3nh - opened
Smol Community org

Hello SmolTuners!

As in description, our main mission is to focus on creating 'small llms' which can be usable to more specific tasks. To do this, we definitely have to focus on dataset which are capable to give as an additive value. I opened this discussion to gather and noted some datasets worth to look for, cause it hase to be starter point to ft (despite of quantization and model merging). Have a great day <3

Smol Community org

https://huggingface.co./datasets/HuggingFaceTB/smoltalk

was used to finetune SmolLM2 - could be worth a look at, I'd probably filter this for math though.

Smol Community org

A thing i've noticed using alot of smaller models is that most often then not, new pretrains of smaller models are not usually the way to go

Instead it's better to finetune upon distilled models such as nvidia/Llama-3.1-Minitron-4B-Width-Base or google/gemma-2-2b-it

Smol Community org

Ill have some writeups on my way, ill post an update today evening, lets go !

Sign up or log in to comment