Table Research Lab

classroom

AI & ML interests

None defined yet.

Recent Activity

TAPEX's activity

dreamerdeoย 
posted an update 9 days ago
view post
Post
2753
๐Ÿš€ Excited to share our technical report on the Southeast Asian multilingual model Sailor2 and its latest updates!

Our 49-page report details Sailor2's development journey, including multilingual data cleaning, small model data mixture simulations, multi-stage continual pre-training, multi-stage post-training, and multi-cultural multi-lingual evaluations. Sailor2 aims to streamline the multilingual model pre-training process efficiently for the community.

๐Ÿงญ We highlight Sailor2's impressive performance in low-resource language translation scenarios and its cultural understanding advantages in Southeast Asia, promoting practical applications for regional languages.

Model updates include:ย 
๐Ÿ’ก More precise outputs: Reduced redundancy in model outputs through refined post-training data and optimization techniques.ย 
๐ŸŒˆ Handling longer texts: Expanded to handle up to 128K context length in Southeast Asian languages through long-text training.ย 
โšก๏ธ Faster inference: Achieved 2.5x faster inference speed with speculative decoding.ย 
๐ŸŒช๏ธ More model sizes: Introduced new sizes of 3B and 14B through model pruning.

๐ŸŒŸ All models are Apache-licensed for commercial use; development tools (code, resources) are open-source.

๐Ÿ“š Technical report: Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs (2502.12982)ย 
๐Ÿค–๏ธ Models: sail/sailor2-language-models-674d7c9e6b4dbbd9a869906bย 
๐Ÿ’ฌ Demo: sail/Sailor2-20B-Chatย 
๐Ÿ“ฃ Sailor2 community: https://huggingface.co./sailor2
SivilTaramย 
posted an update 8 months ago
view post
Post
2673
Still following your human intuition to mix corpora from different sources for pre-training ๐Ÿง ? Everyone says that data mixture has a big impact on model performance, but how - and why๐Ÿ•ต๏ธ? Did you know that web corpora are actually highly impactful for downstream tasks ๐Ÿ†?

Check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" ๐Ÿ“„

๐Ÿ”ฌ In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! ๐Ÿ“ˆ

๐Ÿ“„ Paper: RegMix: Data Mixture as Regression for Language Model Pre-training (2407.01492)
๐Ÿ’ป Code: https://github.com/sail-sg/regmix
๐Ÿ“Š Collection: sail/regmix-data-mixture-as-regression-6682b6caab37b9442877f0ce
๐ŸŽฎ Demo: https://huggingface.co./spaces/sail/RegMix